QUICK REVIEW

[논문 리뷰] Sequential Diagnosis with Language Models

Harsha Nori, Mayank Daswani|ArXiv.org|2025. 06. 27.

Machine Learning in Healthcare인용 수 9

한 줄 요약

본 논문은 NEJM CPC 케이스를 사용하는 대화식 순차 진단 벤치마크인 SDBench와 의사 역할 패널을 시뮬레이션하여 여러 모델에서도 인간 및 기본 LM 대비 우수한 정확도와 비용 효율성을 달성하는 MAI-Diagnostic Orchestrator (MAI-DxO)를 제시한다.

ABSTRACT

Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.

연구 동기 및 목표

실제적이고 반복적인 임상 추론에서 진단 인공지능의 평가를 정적 증례가 아닌 동적인 상황에서 촉진한다.
304 개의 NEJM CPC 사례를 Gatekeeper와 Judge가 있는 단계별 대면으로 전환하여 비용 제약 하에서 정보 수집 및 의사 결정의 질을 평가한다.
모델에 독립적인 오케스트레이터(MAI-DxO)가 다수의 언어 모델에서 진단 정확도를 향상시키고 비용을 줄일 수 있음을 입증한다.

제안 방법

304 개의 NEJM CPC 사례를 Interactive sequential-diagnosis encounters로 전환하여 SDBench를 개발한다.
Gatekeeper LM을 사용하여 사례 발견 정보를 질의할 때만 공개하도록 하여 누출을 방지하고 현실성을 보존한다.
의사 작성 루브릭을 갖춘 Judge 에이전트를 도입하여 진단 정확도( Likert scale 1–5) 점수를 매기고 정답을 score ≥4 로 정의한다.
고정 방문 비용 및 CPT 기반 검사 비용을 부여하여 진단 비용을 정량화하는 비용 모델링을 수립한다.
다섯 가지 역할(Hypothesis, Test-Chooser, Challenger, Stewardship, Checklist)로 구성된 다의사 패널 오케스트레이션 프레임워크 MAI-DxO를 만들어 비용 인식적 방식으로 질문과 검사를 주도한다.
SDBench에서 MAI-DxO와 baseline LMs를 인간 의사와 비교 평가하며, 일반화 가능성을 평가하기 위해 보류된 테스트 케이스를 사용한다.

실험 결과

연구 질문

RQ1AI 에이전트가 임상 실무와 유사한 정보 수집 및 비용 제약 하에서 순차적 진단을 수행할 수 있는가?
RQ2다수의 의사 패널이 오케스트레이션될 때 단일 모델이나 인간 의사에 비해 진단 정확도와 비용이 향상되는가?
RQ3오프-ה- shelf 언어 모델이 서로 다른 모델 계열 간의 순차 진단 작업에서 얼마나 잘 일반화되는가?
RQ4비용 인식 및 도전/도전 역할의 통합이 진단의 질에 미치는 영향은 무엇인가?

주요 결과

MAI-DxO가 OpenAI o3와 짝을 이루면 80% 진단 정확도에 도달하여 일반 의사의 평균 20%에 비해 네 배 높은 성과를 보인다.
MAI-DxO는 의사 대비 진단 비용을 20% 절감하고, 오프-the-shelf o3 대비 70% 감소시킨다.
최대 정확도 구성을 적용하면 MAI-DxO의 정확도는 85.5%에 이른다.
MAI-DxO의 개선은 OpenAI, Gemini, Claude, Grok, DeepSeek, Llama를 포함한 여러 모델 계열에서 일반화된다.
오프-the-shelf o3의 정확도는 78.6%이고 케이스당 비용은 $7,850인 반면, 의사는 평균 19.9%의 정확도에 대해 케이스당 $2,963의 비용을 보인다.
MAI-DxO 설정(예산 없음)은 기준 o3 대비 비용을 $4,735로 절감하며 81.9%의 정확도를 달성하고, 앙상블 변형은 $7,184의 비용으로 85.5%의 정확도를 달성한다.
MAI-DxO는 능동적인 모델들 전반에서 일관되게 정확도를 향상시키며 약한 모델에 대해서도 비용 인식 개선을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.