医療AI、合成データで最強に✨

Published：2025/8/22 20:30:58

医療AI、合成データで最強に✨

最新論文、医療AIを合成データ（疑似データ）で爆上がりさせる方法！すごい～！

💎 ギャル的キラキラポイント✨ ● 医療AIの精度、公平性（バイアスがないこと）を両立させる方法を見つけたってこと💖 ● 既存のデータ不足問題を、合成データで解決しちゃうってとこが天才的💎 ● 医療AIのハードルを下げて、色んな人が恩恵を受けられる未来が来るかも✨

詳細解説いくよ～！背景医療画像AIって、めっちゃ高性能なんだけど、データ不足とか、患者さんの属性（年齢とか性別とか）で診断結果に差が出ちゃうっていう問題があったの😢 この問題を解決するために、今回の研究が始まったんだって！

方法胸部X線写真（CXR）の合成データを作る新しいAI「RoentGen-v2」を開発！このAI、患者さんの情報（性別とか年齢とか）を細かく指定して、リアルな画像を生成できるんだって！合成データと本物のデータを組み合わせてAIをトレーニングすることで、精度も公平性もUPを目指したみたい😎

続きは「らくらく論文」アプリで

Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data

Stefania L. Moroianu / Christian Bluethgen / Pierre Chambon / Mehdi Cherti / Jean-Benoit Delbrouck / Magdalini Paschali / Brandon Price / Judy Gichoya / Jenia Jitsev / Curtis P. Langlotz / Akshay S. Chaudhari

Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at https://github.com/StanfordMIMI/RoentGen-v2 .

cs / cs.CV / cs.AI

Arxivで見る