iconLogo
Published:2025/12/17 6:18:40

AI臨床医(りんしょうい)の信頼性(しんらいせい)UP!GAPSフレームワークって何?👩‍⚕️✨(超要約:AI医療の成績表、爆誕!)

1. タイトル & 超要約

AI医療の評価基準「GAPSフレームワーク」で、AIの信頼性爆上げ!😎✨

2. ギャル的キラキラポイント✨

● AI医療の"成績表"みたいなもの!🌟 性能をちゃんと評価できるようになったってこと! ● 4つの視点(知識、完璧さ、頑丈さ、安全性)でAIをチェック🔎! 総合評価で安心! ● 臨床医(お医者さん)とAIの答え合わせ🤝! 信頼度90%ってすごくない?😳

続きは「らくらく論文」アプリで

GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

Xiuyuan Chen / Tao Sun / Dexin Su / Ailing Yu / Junwei Liu / Zhe Chen / Gangzeng Jin / Xin Wang / Jingnan Liu / Hansong Xiao / Hualei Zhou / Dongjie Tao / Chunxiao Guo / Minghui Yang / Yuan Xia / Jing Zhao / Qianrui Fan / Yanyun Wang / Shuai Zhen / Kezhong Chen / Jun Wang / Zewen Sun / Heng Zhao / Tian Guan / Shaodong Wang / Geyun Chang / Jiaming Deng / Hongchengcheng Chen / Kexin Feng / Ruzhen Li / Jiayi Geng / Changtai Zhao / Jun Wang / Guihu Lin / Peihao Li / Liqi Liu / Peng Wei / Jian Wang / Jinjie Gu / Ping Wang / Fan Yang

Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating Grounding (cognitive depth), Adequacy (answer completeness), Perturbation (robustness), and Safety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment (90% agreement, Cohen's Kappa 0.77). Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice. The benchmark dataset GAPS-NSCLC-preview and evaluation code are publicly available at https://huggingface.co/datasets/AQ-MedAI/GAPS-NSCLC-preview and https://github.com/AQ-MedAI/MedicalAiBenchEval.

cs / cs.CL