QUICK REVIEW

[論文レビュー] Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

Shuai Lu, Meng Wang|arXiv (Cornell University)|Mar 7, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

EyExInはデュアルストリームの専門家アーキテクチャと適応的ディープインジェクションで網膜領域のLVLMを強化し、ドメイン知識を埋め込み、低データ設定で眼科VQAの最先端を実現します。

ABSTRACT

Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.

研究の動機と目的

眼科LVLMにおけるドメイン特有の知識の欠如が知覚と推論のギャップを生むことを是正する。
LVLMに網膜医療知識を埋め込むデータ効率の高いフレームワークを開発する。
視覚知覚を解剖学的ストリームと病理学的ストリームに分離し、適応的に融合する。
視覚的証拠に基づく推論を中間LLM層へ専門家特徴を注入して grounding する。

提案手法

一般的な解剖学的文脈と病理意味論を分離するためのExpert-Aware Dual-Stream Encodingを導入する。
トークンごとに一般的特徴と専門家視覚特徴を動的に重み付けするSemantic-Adaptive Gated Fusionを提案する。
統合された視覚特徴を中間LLM層へ永続的な残差バイアスとして注入するAdaptive Deep Expert Injectionを開発する。
デコード中に臨床推論を視覚証拠に結びつけるためのビジョンアンカー機構を使用する。

実験結果

リサーチクエスチョン

RQ1デュアルストリームの視覚エンコーダ（一般 + 専門家）は網膜像における微細病変検出を改善するか？
RQ2セマンティック適応融合は背景ノイズを抑えつつ病理信号をより良く保持するか？
RQ3専門家特徴の持続的深層注入は層を超えてLLM推論を視覚証拠に grounding できるか？
RQ4EyExInはクローズドおよびオープンエンドの retinal VQA で、独自およびオープンソースのベースラインと比較してどうか？

主な発見

Dataset	Method	Closed VQA (%): F1	Closed VQA (%): Recall	Closed VQA (%): Prec	Open-ended VQA (%): F1	Open-ended VQA (%): Recall	Open-ended VQA (%): Prec
TM4K	EyExIn (FT)	78.07	82.42	77.33	72.91	78.99	71.87
JSIEC	EyExIn (FT)	80.66	82.33	85.20	63.10	76.32	60.84
Retina	EyExIn (FT)	71.27	67.90	89.68	67.80	62.30	96.15
ODIR	EyExIn (FT)	60.09	59.81	64.69	56.70	55.20	60.40

EyExInはTM4KのClosed VQAでF1 78.07、JSIECのClosed VQAでF1 80.66という最先端を達成。
EyExInは病理再現性が高く（TM4K 82.42、JSIEC 82.33）、オープンエンドの精度も高い（TM4K 71.87、JSIEC 60.84）。
適応的ゲーティングと深層注入はナイーブな融合を大きく上回り、TM4K Closed VQAのアブレーションでEyExInが最も良いF1値（78.07）を示す。
定性的ケースでは、Gemini3-Proと比較してOpen VQAで正確な病変 grounding と幻像の低減を示す。
テキスト生成指標はTM4KでBLEU-1、ROUGE-L、METEOR、BERT-F1の全指標でEyExInを優位に示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。