QUICK REVIEW

[論文レビュー] MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

Kevin Wu, Eric Q. Wu|ArXiv.org|May 16, 2025

Biomedical Text Mining and Ontologies被引用数 4

ひとこと要約

オープンアクセスのデータセットとベンチマークで、LLMの診断推論が臨床医著の推論とどれだけ一致するかを評価し、推論トレースでの微調整が精度と推論リコールの両方を向上させることを示す。

ABSTRACT

Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.

研究の動機と目的

LLM診断推論と臨床医著推論の整合性を評価するオープンベンチマークを提供する。
PubMed Centralの症例報告から臨床医提供の推論を含む実際の診断ケースの大規模で高品質なデータセットを構築する。
MedCaseReasoningベンチマークを用いて、現在の最先端およびオープンソースのLLMの診断精度と推論リコールを評価する。
MedCaseReasoning推論トレースに対する監視付き微調整が、診断精度と推論リコールの両方を改善し、NEJM CPCケースへ一般化することを示す。

提案手法

臨床医著の症例報告（鑑別診断と最終診断を含む）から14,489症例のMedCaseReasoningデータセットを組み立てる。
症例報告をQA形式に変換し、品質と忠実度を確保するために多段階フィルタリングと臨床医による検証を適用する。
Reasoning Recallを、臨床医提供の理由とモデル推論トレースの重なりを定量化する指標として定義する。
LLMをジャッジとして用いた10ショット promptingで診断精度を評価する。（gpt-4o-miniを使用）
モデルの推論トレースと地上 truthの臨床医の理由を比較して推論リコールを算出する。
stitched推論トレースを用いてOpen-Sourceモデル（Qwen-2.5-7B-Instruct、LLaMA-3.1-8B-Instruct、MedReason-8B）を3エポックのSFTで微調整する。
MedCaseReasoningテストセットとNEJM CPCの保持-outケースでの一般化を評価して性能を比較する。

実験結果

リサーチクエスチョン

RQ1現実の臨床ケースを診断する際、推論能力を持つLLMはどれだけ正確に診断でき、臨床医提供の推論をどれだけ再現できるか。
RQ2MedCaseReasoningの推論トレースでLLMsを微調整すると診断精度と推論リコールは改善されるか。
RQ3MedCaseReasoningの性能はNEJM CPC診断ケースの性能とどのように相関するか。
RQ4オープンソース医療LLMの臨床推論を recalling する能力に対する監視付き微調整の影響は何か。
RQ5Reasoning Recall指標の診断能力の代理指標としての妥当性はどの程度か。

主な発見

Model	Reasoning Recall	1-shot Acc.	5-shot Acc.	10-shot Acc.
OpenAI o3	N/A	0.470 (0.440-0.500)	0.609 (0.579-0.639)	0.645 (0.618-0.675)
DeepSeek R1	0.642 (0.616-0.667)	0.320 (0.291-0.349)	0.447 (0.417-0.478)	0.480 (0.450-0.510)
QwQ-32B	0.590 (0.560-0.619)	0.272 (0.245-0.302)	0.371 (0.341-0.400)	0.398 (0.368-0.428)
MedReason-8B	0.407 (0.383-0.431)	0.248 (0.224-0.275)	0.331 (0.303-0.363)	0.382 (0.353-0.412)
LLaMA-3.1-8B-Instruct	0.451 (0.428-0.475)	0.161 (0.138-0.184)	0.281 (0.252-0.311)	0.332 (0.304-0.360)
m1-7b-23k	0.495 (0.440-0.551)	0.155 (0.133-0.177)	0.238 (0.211-0.264)	0.291 (0.262-0.321)
Qwen-2.5-7B	0.324 (0.301-0.347)	0.174 (0.152-0.197)	0.252 (0.223-0.279)	0.287 (0.259-0.316)

トップモデルは限定的な診断推論を示す；OpenAI o3はMedCaseReasoningで10ショット精度64.5%（本文には65%と報告）に達し、DeepSeek R1は48.0%に達する。
MedCaseReasoningテストセットでDeepSeek R1のReasoning recallは約64.2%であり、多くのモデルは臨床医の推論の一部のみを recalling する。
MedCaseReasoningトレースでの微調整は顕著な利得を生む：MedReason-8B（SFT）は10ショット精度50.1%に達し、基準の38.2%から改善；Qwen-2.5-7B（SFT）は10ショット精度42.5%に達し、基準の28.5%から改善。
微調整はNEJM CPCの性能も向上させ、MedCaseReasoningデータセットを超えた一般化を示す。
MedCaseReasoningとNEJM CPC診断性能には強い相関がある（Figure 2）。
Reasoning Recallはモデルの総合性能と推論 tracesの長さと強く相関する（Pearson r = 0.710, p = 0.0485 および r = 0.790, p = 0.0196）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。