最強ギャルAI、EPAGでLLMを診断しちゃう💖

Published：2026/1/7 6:15:21

最強ギャルAI、EPAGでLLMを診断しちゃう💖

超要約: LLMの「事前診察」能力を評価する、新しい方法を開発したよ！
ギャル的キラキラポイント✨
- ● 医療の未来がアゲ🔥！AIが問診をサポートしちゃうかも。
- ● 診断ガイドライン（診断の教科書）に基づいた評価だって！
- ● オープンソース（みんなに公開）だから、さらに進化しそう！
詳細解説
- 背景: LLM（大規模言語モデル）って、医療でも色々使えるけど、まだ完璧じゃないから、安全に使うために評価方法が必要だったんだよね🤔。
- 方法: EPAGっていう評価方法を作って、LLMが患者さんの情報をちゃんと聞いて、診断ガイドラインに沿って質問できるかチェックしたよ！2段階評価で、めっちゃ細かいとこまで見てる👀！
- 結果: GPT-4とかClaudeとか、色んなLLMで試した結果、モデルの大きさとかでパフォーマンス（性能）が変わるってことが分かったみたい😳。
- 意義（ここがヤバい♡ポイント）: これでLLMがもっと医療で活躍できるようになるかも！誤診（間違った診断）のリスクも減らせるから、安心して治療を受けられる未来が来るかもね✨！
リアルでの使いみちアイデア💡
- AI先生が、症状を質問してくれるアプリとかあったら、便利じゃん？
- お医者さんが、診断のヒントをAIからもらえたら、もっと早く治療できそう！

続きは「らくらく論文」アプリで

Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

Jean Seo / Gibaeg Kim / Kihun Shin / Seungseop Lim / Hyunkyung Lee / Wooseok Han / Jongwon Lee / Eunho Yang

We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

cs / cs.CL / cs.AI

Arxivで見る