QUICK REVIEW

[論文レビュー] EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

Qiuhui Chen, Xiaolei Yao|arXiv (Cornell University)|Feb 22, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

EMAD は 3D MRI および臨床データを共同推論することで、明示的な文–エビデンス–解剖学の grounding と実行可能ルール RL 微調整を用いた、構造化されたエビデンス根拠付きの AD 診断レポートを生成するビジョン–言語フレームワークである。

ABSTRACT

Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

研究の動機と目的

臨床ガイドラインに整合し、各主張に対して明示的エビデンスを提供する透明な AD 診断システムの構築を動機付ける。
各文を臨床エビデンスと局在する脳解剖学へ grounding するマルチモーダル vision–language モデルを提案する。
ラベル効率的 grounding transfer (GTX-Distill) によるアノテーションコストの削減と、実行可能ルール RL 微調整 (GRPO) による臨床的一貫性の強制を行う。
大規模コホート全体でキャリブレーションされた診断と解剖学的に忠実な報告を実現する。

提案手法

3D sMRI と構造化臨床データを用いた双方向クロスアテンション融合を持つマルチモーダルエンコーダで統一表現を獲得。
文–エビデンス–解剖学 (SEA) Grounding：各文を臨床エビデンスへ grounding し、次に 3D MRI の解剖学マスクへ grounding。
教師付き制約の少ない grounding からモデル生成レポートを用いた生徒へ転移する GTX-Distill。
Executable-Rule GRPO：構造化出力、NIA-AA 一貫性、推論–診断含意を担保する検証可能報酬付き RL。
三段階訓練：前訓練（ITC と再構成）、GTX-Distill と SEA を用いた監督付き微調整、GRPO を用いた強化微調整。
grounding は文–エビデンス整列のための多陽性 InfoNCE と、解剖学 grounding のためのエビデンス条件付き 3D 分割を用いる。

実験結果

リサーチクエスチョン

RQ1マルチモーダルモデルは、臨床エビデンスと解剖学的局在の両方に明示的に grounding された AD 診断を生成できるか。
RQ2GTX-Distill による grounding 知識の転移は、 grounding 品質を保ちつつアノテーションコストを削減できるか。
RQ3実行可能ルールの強化学習は、臨床的忠実度と診断ガイドラインの守遵を AD レポートで向上させるか。
RQ4EMAD は CN/MCI/AD のステージングおよび AD-MultiSense での透明かつ解剖学的に忠実なレポート生成でどの程度性能を示すか。

主な発見

Method	BLEU	METEOR	ROUGE	BERT	ACC (%)	AUC (%)	SEN (%)	SPE (%)
CN vs CI - LLaVA-1.5-7B ∗	0.0831	0.2417	0.2795	0.8012	74.23	70.58	62.14	82.36
CN vs CI - LLaVA-Med ∗	0.1024	0.2635	0.3042	0.8137	76.41	73.27	64.89	84.72
CN vs CI - Med-PaLM-M ∗	0.1189	0.2826	0.3314	0.8293	79.12	76.84	67.53	86.19
CN vs CI - M3d-LaMed ∗	0.1375	0.2982	0.3598	0.8341	82.37	79.65	70.94	87.56
CN vs CI - LLaVA-1.5-7B	0.2973	0.4764	0.5987	0.8485	86.42	83.19	80.37	88.54
CN vs CI - LLaVA-Med	0.3186	0.4981	0.6179	0.8592	88.57	85.03	82.16	90.28
CN vs CI - Med-PaLM-M	0.3394	0.5173	0.6371	0.8726	90.13	87.42	84.95	92.07
CN vs CI - M3d-LaMed	0.3627	0.5419	0.6594	0.8748	91.28	89.16	86.72	93.14
CN vs CI - EMAD (ours)	0.5422	0.6790	0.7781	0.9130	93.33	91.83	88.67	95.00
CN vs MCI - LLaVA-1.5-7B ∗	0.0715	0.2283	0.2594	0.7886	71.18	68.47	63.52	77.39
CN vs MCI - LLaVA-Med ∗	0.0897	0.2472	0.2816	0.7991	73.42	70.59	66.84	79.21
CN vs MCI - Med-PaLM-M ∗	0.1123	0.2698	0.3097	0.8184	76.35	73.48	68.92	82.17
CN vs MCI - M3d-LaMed ∗	0.1294	0.2875	0.3391	0.8217	78.64	76.23	71.37	84.53
CN vs MCI - LLaVA-1.5-7B	0.2826	0.4627	0.5789	0.8421	84.27	82.14	79.63	87.18
CN vs MCI - LLaVA-Med	0.3018	0.4815	0.6012	0.8534	86.39	84.27	81.45	89.32
CN vs MCI - Med-PaLM-M	0.3241	0.5036	0.6228	0.8649	88.21	86.45	83.72	91.08
CN vs MCI - M3d-LaMed	0.3437	0.5219	0.6413	0.8685	89.47	88.06	85.29	92.36
CN vs MCI - EMAD (ours)	0.5343	0.6421	0.7912	0.9130	92.82	90.09	88.60	93.50
Three-way CN/MCI/AD - EMAD (ours)	-	-	-	-	89.4	87.8	90.6	86.3

EMAD は CN 対 CI および CN 対 MCI の診断性能で最先端の成績を達成し、レポート品質指標と精度/AUC の両方で強力な医療 LLM ベースラインを上回る。
SEA grounding と GTX-Distill によって、文–エビデンスおよびエビデンス–解剖学の整合性が大幅に向上（R@1 最大 0.65、MAP 最大 0.76）。
エビデンス条件付きの 3D 分割は、画像のみの分割と比較して海馬と内側側頭葉の grounding の Dice スコアを高くする。
GTX-Distill によりラベル効率的な grounding 転送が可能で、 grounding ラベルの 25% で教師モデルの約 95% の性能を維持。
Executable-Rule GRPO は構造化形式の妥当性、NIA-AA 一貫性、推論–診断含意を改善しつつ診断精度を維持。
EMAD は生成レポートにおける主張と測定値・脳構造との明示的な結びつきを提供。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。