QUICK REVIEW

[論文レビュー] Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning

Weiqin Yang, Haowen Xue|arXiv (Cornell University)|Jan 26, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

MCRAGを提案する。これは多 Modalな因果グラフを構築・活用する retrieval-augmented generation フレームワークで、医療ビジョン言語推論を地に grounded に行い、放射線科の VQA とレポート生成における事実性と頑健性を向上させる。

ABSTRACT

Medical vision-language models (VLMs) achieve strong performance in diagnostic reporting and image-text alignment, yet their underlying reasoning mechanisms remain fundamentally correlational, exhibiting reliance on superficial statistical associations that fail to capture the causal pathophysiological mechanisms central to clinical decision-making. This limitation makes them fragile, prone to hallucinations, and sensitive to dataset biases. Retrieval-augmented generation (RAG) offers a partial remedy by grounding predictions in external knowledge. However, conventional RAG depends on semantic similarity, introducing new spurious correlations. We propose Multimodal Causal Retrieval-Augmented Generation, a framework that integrates causal inference principles with multimodal retrieval. It retrieves clinically relevant exemplars and causal graphs from external sources, conditioning model reasoning on counterfactual and interventional evidence rather than correlations alone. Applied to radiology report generation, diagnosis prediction, and visual question answering, it improves factual accuracy, robustness to distribution shifts, and interpretability. Our results highlight causal retrieval as a scalable path toward medical VLMs that think beyond pattern matching, enabling trustworthy multimodal reasoning in high-stakes clinical settings.

研究の動機と目的

医療ビジョン言語モデル（Med-LVLMs）の事実性と偽相関を解消する。
クロスモーダル検索を導くマルチモーダル構造的因果モデル（SCM）を導入する。
医療文献からの因果的証拠に出力を grounding する検索拡張生成フレームワークを開発する。

提案手法

VLM支援の発見と手動精緻化を伴う、画像とレポート対ペアデータからマルチモーダル因果整合グラフを構築する。
SCMベースの検索スコアリングを定義：Score(Rk) = (1-α) log pG(VD,VF | VI) + α sim(VI, Rk)。
グラフ経路に沿って因果的整合性を持つ上位Kレポートを選択するためのドメイン認識を伴う検索を実施する。
RAG設定内で因果フィルタリングと再ランキングを適用し、カバレッジと精度のバランスを取る。
検索拡張監視で生成器を微調整し、画像-groundedで因果的に一貫した出力を保証する。

実験結果

リサーチクエスチョン

RQ1マルチモーダル医療データから構造的因果モデルを構築してクロスモーダル検索を導くことは可能か。
RQ2因果性に基づく検索は、従来のRAG法と比較して放射線科のVQAとレポート生成の事実性と頑健性を向上させるか。
RQ3 manual refinement と因果比設定が性能と幻覚抑制にどのような影響を及ぼすか。
RQ4証拠の検索におけるカバレッジと精度のトレードオフはどうなるか。

主な発見

Model	Acc (IU-Xray)	F1 (IU-Xray)	AUC (IU-Xray)	Acc (MIMIC-CXR)	F1 (MIMIC-CXR)	AUC (MIMIC-CXR)	BLEU (IU-Xray)	R-L (IU-Xray)	MET (IU-Xray)	BLEU (MIMIC-CXR)	R-L (MIMIC-CXR)	MET (MIMIC-CXR)
LLaVA-Med-1.5	75.47	64.04	67.46	75.79	80.49	68.84	9.64	12.26	8.21	12.11	13.05	11.16
MMed-RAG	89.54	80.72	87.13	83.57	88.49	85.08	31.38	25.59	32.43	23.25	12.34	20.47
MCRAG	90.12	82.03	88.25	84.91	89.37	86.42	35.02	28.47	35.18	25.81	15.05	22.34

MCRAGは強力なベースラインと比較して放射線科VQAとレポート生成の最先端の結果を達成。
アブレーション研究により因果性を除去すると精度が大幅に悪化（例：MIMIC-CXR VQAでAcc約3.65ポイント低下）し、流暢性（BLEU低下）も低下。
手動精緻化を伴う因果性を指向した検索（τ = 0.7）はタスク全般で安定した性能を示す。
最適な検索設定はKとフィルタリング/再ランキング効果のバランスであり、アブレーションでK=10が最良となる。
IU-XrayのVQAでMCRAGは90.12 Acc、82.03 F1、88.25 AUC；MIMIC-CXRのVQAで84.91 Acc、89.37 F1、86.42 AUC；レポート生成ではIU-Xrayで35.02 BLEU、28.47 R-L、35.18 MET、MIMIC-CXRで25.81 BLEU、15.05 R-L、22.34 MET。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。