Skip to main content
QUICK REVIEW

[论文解读] Towards Understanding Multimodal Fine-Tuning: Spatial Features

Lachin Naghashyar, Hunar Batra|arXiv (Cornell University)|Feb 6, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

本文将阶段性模型差分扩展到视觉–语言模型,识别语言骨干中的空间定位特征,并通过因果归因与消融将它们追踪到少数注意头。

ABSTRACT

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

研究动机与目标

  • 了解在多模态微调过程中预训练语言骨干如何适应视觉定位。
  • 识别在编码空间关系时会旋转或重新定向以偏好视觉的特征。
  • 分离出支撑空间推理的紧凑特征集合,并追踪其因果驱动因素。

提出的方法

  • 将对 Llama 骨干训练的稀疏自编码器(SAEs)改编为多模态激活,来自 LLaVA-MORE。
  • 使用阶段式模型差分来检测旋转并获得视觉偏好性的特征。
  • 通过比较具有空间提示与中性提示下的性能并过滤词汇伪影来识别空间特征。
  • 应用归因打补(attribution patching)以定位驱动空间特征的中间层注意头。
  • 进行消融实验以测试空间特征在空间推理任务中的因果参与度。
Figure 1 : SAE adaptation on LLaVA-MORE . Top: Mean fraction of variance unexplained (FVU) across layers on the validation set. Bottom: Summary statistics of FVU values on the validation set, with decimal alignment; the lowest mean is highlighted in bold .
Figure 1 : SAE adaptation on LLaVA-MORE . Top: Mean fraction of variance unexplained (FVU) across layers on the validation set. Bottom: Summary statistics of FVU values on the validation set, with decimal alignment; the lowest mean is highlighted in bold .

实验结果

研究问题

  • RQ1多模态微调如何重塑 VLMs 中语言骨干的表示?
  • RQ2哪些特征变得偏好视觉并如何重新定向以编码空间关系?
  • RQ3哪些注意头因果驱动空间定位特征,它们是如何组织的?

主要发现

  • 在多模态适应后,少量特征(约 %5)表现出视觉偏好并且具有强几何旋转性。
  • 在关于放置、相对位置和方向的问题上,定义的空间特征集始终被激活。
  • 归因打补揭示了一组位于中间层的稀疏注意头负责空间定位。
  • 对顶部空间特征的消融导致空间推理性能显著下降(在 VSR 上 9–16 点),对一般 VQA 的影响有限,表明功能的特异性。
  • 相同的注意头在相关的空间关系中反复出现,提示存在用于视觉定位的结构化通路。
  • 该方法提供了一个特征级的多模态适应机械视图,补充了更高层对齐分析。
Figure 2 : Distribution of SAE features by visual energy and cosine similarity. All features are shown in gray; adapted features are highlighted in pink. Spatial candidates are marked with blue squares, and the subset used for downstream analysis is shown as red crosses.
Figure 2 : Distribution of SAE features by visual energy and cosine similarity. All features are shown in gray; adapted features are highlighted in pink. Spatial candidates are marked with blue squares, and the subset used for downstream analysis is shown as red crosses.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。