Skip to main content
QUICK REVIEW

[論文レビュー] Towards Understanding Multimodal Fine-Tuning: Spatial Features

Lachin Naghashyar, Hunar Batra|arXiv (Cornell University)|Feb 6, 2026
Multimodal Machine Learning Applications被引用数 0
ひとこと要約

論文は段階的モデル差分解析をビジョン-言語モデルへ拡張し、言語バックボーンにおける空間的に根拠づけられた特徴を特定し、因果属性付けとアブレーションを通じて小さな注意ヘッドの集合へと追跡する。

ABSTRACT

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

研究の動機と目的

  • Understand how pretrained language backbones adapt under visual grounding during multimodal fine-tuning.
  • Identify vision-preferring features that rotate or reorient to encode spatial relations.
  • Isolate a compact set of features that underpin spatial reasoning and trace their causal drivers.

提案手法

  • Adapt sparse autoencoders (SAEs) trained on Llama backbones to multimodal activations from LLaVA-MORE.
  • Use stage-wise model diffing to detect features that rotate and gain visual preference.
  • Identify spatial features by comparing performance under spatial prompts vs. neutral prompts and filtering lexical artifacts.
  • Apply attribution patching to locate the mid-layer attention heads driving spatial features.
  • Perform ablations to test the causal involvement of spatial features in spatial reasoning tasks.
Figure 1 : SAE adaptation on LLaVA-MORE . Top: Mean fraction of variance unexplained (FVU) across layers on the validation set. Bottom: Summary statistics of FVU values on the validation set, with decimal alignment; the lowest mean is highlighted in bold .
Figure 1 : SAE adaptation on LLaVA-MORE . Top: Mean fraction of variance unexplained (FVU) across layers on the validation set. Bottom: Summary statistics of FVU values on the validation set, with decimal alignment; the lowest mean is highlighted in bold .

実験結果

リサーチクエスチョン

  • RQ1How does multimodal fine-tuning reshape language-backbone representations in VLMs?
  • RQ2Which features become vision-preferring and how do they reorient to encode spatial relations?
  • RQ3Which attention heads causally drive spatially grounded features and how are they organized?

主な発見

LayerFeatureDelta VSR AccDelta VQA AccDelta CtrlVSR RelationVSR OR
715870-15.54-0.10-0.88above4.32
1127061-12.77-0.400.00across from8.03
915404-11.19-0.801.08below5.60
1417873-10.21-0.30-1.71at the right side of7.17
1223874-9.05-0.40-0.95left of9.10
1829948-7.98-0.300.00beside8.36
  • A small subset of features (~5%) shows visual preference and strong geometric rotation after multimodal adaptation.
  • A defined spatial feature set consistently activates for questions about placement, relative position, and orientation.
  • Attribution patching reveals a sparse group of mid-layer heads responsible for spatial grounding.
  • Ablation of top spatial features causes substantial drops in spatial reasoning performance (9–16 points on VSR) with limited impact on general VQA, indicating functional specificity.
  • The same heads recur across related spatial relations, suggesting structured pathways for visual grounding.
  • The approach provides a feature-level mechanistic view of multimodal adaptation, complementing higher-level alignment analyses.
Figure 2 : Distribution of SAE features by visual energy and cosine similarity. All features are shown in gray; adapted features are highlighted in pink. Spatial candidates are marked with blue squares, and the subset used for downstream analysis is shown as red crosses.
Figure 2 : Distribution of SAE features by visual energy and cosine similarity. All features are shown in gray; adapted features are highlighted in pink. Spatial candidates are marked with blue squares, and the subset used for downstream analysis is shown as red crosses.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。