QUICK REVIEW

[論文レビュー] Sequential Diagnosis with Language Models

Harsha Nori, Mayank Daswani|ArXiv.org|Jun 27, 2025

Machine Learning in Healthcare被引用数 9

ひとこと要約

本論文は SDBench を提案し、NEJM CPC cases を使用したインタラクティブな逐次診断ベンチマークと、複数モデルに対して人間およびベースライン LMs と比較して優れた精度とコスト効率を実現する MAI-Diagnostic Orchestrator（MAI-DxO）を提示します。

ABSTRACT

Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.

研究の動機と目的

現実的で反復的な臨床推論の静的ビネットではなく、診断AIの評価を動機づける。
304件の NEJM CPC ケースをゲートキーパーとジャッジを用いた段階的な encounter に変換し、情報収集と意思決定の質をコスト制約の下で評価する。
モデルに依存しないオーケストレータ（MAI-DxO）が、複数の言語モデルにわたり診断精度を向上させ、コストを削減できることを示す。

提案手法

304件の NEJM CPC ケースをインタラクティブな逐次診断 encounters に変換して SDBench を開発する。
Gatekeeper LM を用いてケース所見を問合せ時のみ開示し、情報漏えいを防ぎ現実感を保つ。
医師著述のルーブリックを用いる Judge エージェントを導入し、診断正確性を Likert スケール 1–5 で評価し、正解を score ≥4 と定義する。
固定の訪問コストと CPT ベースの検査コストを割り当てて診断費用を定量化するコストモデリングを確立する。
5つの役割（Hypothesis, Test-Chooser, Challenger, Stewardship, Checklist）を持つマルチ physicians パネルのオーケストレーションフレームワーク MAI-DxO を作成し、コストを意識した質問と検査を導く。
SDBench で MAI-DxO とベースライン LMs を人間の医師と比較評価し、 held-out テストケースを用いて汎化能力を評価する。

実験結果

リサーチクエスチョン

RQ1AI エージェントは臨床実践に近い情報収集とコスト制約の下で逐次診断を実行できるか。
RQ2オーケストレーションされた複数医師パネルは、単一モデルや人間の医師と比較して診断精度を向上させ、コストを削減できるか。
RQ3オフ・ザ・シft・ Language Models は、逐次診断タスクにおいて異なるモデルファミリ間でどれだけ汎化するか。
RQ4コスト認識と対抗・挑戦役割を取り入れることが診断の質に与える影響は何か。

主な発見

MAI-DxO と OpenAI o3 の組み合わせは 80% の診断精度を達成し、一般開業医の平均 20% に比べ四倍の高精度を示す。
MAI-DxO は医師と比較して診断コストを 20% 減らし、オフ-the-shelf o3 と比べて 70% 減少。
最大精度を目指して設定した場合、MAI-DxO の精度は 85.5% に達する。
MAI-DxO の改善は OpenAI、Gemini、Claude、Grok、DeepSeek、Llama など複数のモデルファミリーに対して汎化する。
オフ-the-shelf o3 はケースあたりのコスト $7,850、精度 78.6%、医師は平均 19.9% の精度で $2,963 のケースあたりコスト。
MAI-DxO の設定（予算なし）は Baseline o3 に対してコスト削減で $4,735、精度 81.9% を達成し、アンサンブル変種はコスト $7,184 で 85.5% の精度を達成する。
MAI-DxO は有能なモデル群全体で一貫して精度を向上させ、弱いモデルにもコスト意識の改善を顕著にもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。