QUICK REVIEW

[论文解读] Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Guankun Wang, Long Bai|arXiv (Cornell University)|Mar 22, 2024

Multimodal Machine Learning Applications被引用 5

一句话总结

Surgical-LVLM 将一个大型视觉语言模型个性化为具备 Visual Perception LoRA 和 Token-Interaction 模块，以提升外科 VQA 的定位与推理，在 EndoVis 数据集上实现最先进的结果。

ABSTRACT

Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recognizing long-range dependencies and aligning multimodal information. In this paper, we introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. Leveraging the pre-trained large vision-language model and specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in understanding complex visual-language tasks within surgical contexts. In addressing the visual grounding task, we propose the Token-Interaction (TIT) module, which strengthens the interaction between the grounding module and the language responses of the Large Visual Language Model (LVLM) after projecting them into the latent space. We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset, which sets new performance standards. Our work contributes to advancing the field of automated surgical mentorship by providing a context-aware solution.

研究动机与目标

动机：在外科 VQA 与 VQLA 任务中，需求领域特定的 grounding。
提出 Surgical-LVLM 作为针对复杂外科场景的个性化 LVLM。
引入 Visual Perception LoRA (VP-LoRA) 以实现远距离上下文理解。
开发 Token-Interaction (TIT) 模块，使语言输出与视觉定位对齐。
在 EndoVis-17/18 VQLA 数据集以及新的 EndoVis Conversations 数据集上验证该方法。

提出的方法

对 Qwen-VL 进行微调，配备在 LoRA 层中插入 Visual State Space (VSS) 的 VP-LoRA 模块，以传播全局上下文。
引入基于投影的多模态对齐，将 Qwen-VL 的语言输出通过 TIT 模块与 CAT-ViL 的定位对接融合。
使用两阶段训练：(i) 针对外科问答对进行视觉-语言指令微调，(ii) 语言与定位模块之间的多模态对齐。
构建一个基于 EndoVis 的指令微调数据集，使用 GPT-4 按 Qwen-VL 格式生成。
利用 CAT-ViL 共注意嵌入进行定位，并整合一个标记交互通路以强调重要的视觉-语言标记。

实验结果

研究问题

RQ1个性化的 LVLM 是否能够有效适应在机器人手术中执行基于定位的 VQA？
RQ2VP-LoRA 块是否在外科环境中提升长期视觉语言理解？
RQ3指令微调结合多模态对齐是否在 EndoVis 任务中实现了最先进的定位与推理？
RQ4Surgical-LVLM 在 EndoVis-17/18 VQLA 和新的 EndoVis Conversations 数据集上的表现如何？
RQ5对 VP-LoRA 和多模态对齐的消融研究对总体性能有什么影响？

主要发现

结合 VP-LoRA 与指令微调的 Surgical-LVLM 在 EndoVis-Conversations 数据集的 EndoVis-18-VQLA 与 EndoVis-17-VQLA 比较中获得最高的 GPT-4 风格分数（分别为 90.68 和 83.24）。
指令微调显著提升外科领域的逻辑推理与回答。
VP-LoRA 持续提升语言响应质量与定位性能。
多模态对齐（MA）加上 VP-LoRA 产生最佳的整体定位结果，联合时具有协同增益。
在 EndoVis-18-VQLA 上，Surgical-LVLM 达到 Acc 0.6947，F-Score 0.3325，mIoU 0.8416；在 EndoVis-17-VQLA 上，Acc 0.4068，F-Score 0.3412，mIoU 0.7825。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。