QUICK REVIEW

[论文解读] Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Yiwei Li, Zihao Wu|arXiv (Cornell University)|Mar 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

论文提出凝视-token 监督，利用时序排列的眼动轨迹数据引导医学视觉-语言模型模仿放射科医生的逐步视觉推理，提升领域内准确性和零样本鲁棒性。

ABSTRACT

Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.

研究动机与目标

使用放射科医生的序列凝视作为医学VLM的可视化推理监督信号的动机。
开发一种轻量级的凝视-token机制，使模型注意力与凝视派生的patch索引对齐。
实现固定格式的放射学报告，同时提高诊断准确性和可解释性。
在MIMIC-EYE上评估领域内性能，在外部放射数据集上评估零样本鲁棒性。

提出的方法

使用一个预训练VLM骨干（Qwen2.5-VL-7B-Instruct），在输出序列中嵌入四个专用的凝视token。
训练凝视投影头，将凝视-token隐藏状态映射到patch索引，强制凝视目标的时序顺序。
附加一个14标签分类头，用于多标签放射学发现，采用固定的Yes/No报告格式。
阶段1通过对离散化的凝视patch进行交叉熵优化凝视-token到patch-index的对齐；阶段2通过多标签BCE损失进行优化（可选地与语言模型损失联合）。
使用LoRA适配器进行微调，在保持骨干网络冻结的同时学习轻量级的凝视监督组件。
将凝视监督表示为来自时间对齐的凝视热力图离散化到图像patch网格得到的patch索引。

实验结果

研究问题

RQ1时序有序的眼动凝视监督是否能够改善医学VLM的视觉基础推理？
RQ2将凝视-token监督引入后，是否在胸部X线解读任务上优于指令微调基线？
RQ3凝视引导训练如何影响域内性能以及对外部数据集的跨域泛化？

主要发现

Method	AUROC	Acc.	F1
Vanilla	49.74	42.15	43.09
SFT	87.60	86.03	84.18
SFT-Heatmap	87.51	86.51	84.23
MedCLIP	87.37	86.63	84.32
EGMA	89.49	88.11	86.20
Random-Gaze	86.45	85.59	81.06
Shuffled-Gaze	88.51	87.48	84.97
Original-Gaze	90.17	89.02	87.61

凝视引导的训练在域内相较基线具有一致的提升，Original-Gaze在MIMIC-EYE上达到最高的AUROC。
阶段1的凝视监督加上固定格式输出在仅有指令微调的基础上显著提升性能。
保持凝视信号的时间顺序相比随机或打乱的凝视提供最强的增益。
凝视监督在CheXpert、RSNA和SIIM-ACR基准上提升零样本准确性和F1，指示对域外鲁棒性的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。