[论文解读] Sequential Diagnosis with Language Models
本文提出 SDBench——一个使用 NEJM CPC 病例的交互式序列诊断基准,以及 MAI-Diagnostic Orchestrator (MAI-DxO),通过模拟一组医生角色,在多种模型下在准确性和成本效率方面均优于人类和基线语言模型。
Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.
研究动机与目标
- 推动在真实、迭代的临床推理中评估诊断 AI,而不是静态病历片段。
- 将 304 个 NEJM CPC 病例转化为带有 Gatekeeper 和 Judge 的逐步会诊,以在成本约束下评估信息获取和决策质量。
- 证明一个对模型无关的编排器(MAI-DxO)能够在多种语言模型中提升诊断准确性并降低成本。
提出的方法
- 通过将 304 个 NEJM CPC 病例转换为互动式的序贯诊断场景,来开发 SDBench。
- 使用 Gatekeeper LM 仅在查询时揭示病例发现,防止信息泄露并保持真实感。
- 引入由医生撰写评定标准的 Judge 代理,以对诊断准确性进行评分(Likert 量表 1–5),并将正确性定义为得分≥4。
- 通过分配固定就诊成本和基于 CPT 的检测成本来建立成本建模,以量化诊断费用。
- 创建 MAI-DxO,一个包含五个角色(Hypothesis, Test-Chooser, Challenger, Stewardship, Checklist)的多医生小组编排框架,以成本感知方式引导提问和检测。
- 在 SDBench 上评估 MAI-DxO 与基线 LM 相对于人类医生的表现,使用保留的测试用例来评估泛化。
实验结果
研究问题
- RQ1AI 代理是否能够在接近临床实践的真实信息收集与成本约束下进行序贯诊断?
- RQ2在多医生协作编排下,诊断准确性是否优于单一模型和人类医生并降低成本?
- RQ3现成的语言模型在不同模型家族间对序贯诊断任务的泛化能力如何?
- RQ4将成本感知与对抗/挑战角色引入对诊断质量的影响是什么?
主要发现
- MAI-DxO 与 OpenAI o3 组合实现 80% 的诊断准确率,是一般医生平均 20% 的四倍。
- MAI-DxO 将诊断成本较医生降低 20%,较现成的 o3 降低 70%。
- 在最大化准确率的配置下,MAI-DxO 达到 85.5% 的准确性。
- MAI-DxO 的改进在包括 OpenAI、Gemini、Claude、Grok、DeepSeek 与 Llama 等多个模型家族中具有泛化性。
- 现成的 o3 达到 78.6% 的准确率,成本为每案 $7,850;而医生的平均准确率为 19.9%,每案成本为 $2,963。
- MAI-DxO 配置(无预算)实现 81.9% 的准确性,成本较基线 o3 降至 $4,735;集成变体在 $7,184 时达到 85.5% 的准确性。
- MAI-DxO 在可用模型上持续提高准确性,并在较弱模型上也显著提供成本感知的改进。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。