[論文レビュー] Towards Understanding Multimodal Fine-Tuning: Spatial Features
論文は段階的モデル差分解析をビジョン-言語モデルへ拡張し、言語バックボーンにおける空間的に根拠づけられた特徴を特定し、因果属性付けとアブレーションを通じて小さな注意ヘッドの集合へと追跡する。
Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.
研究の動機と目的
- Understand how pretrained language backbones adapt under visual grounding during multimodal fine-tuning.
- Identify vision-preferring features that rotate or reorient to encode spatial relations.
- Isolate a compact set of features that underpin spatial reasoning and trace their causal drivers.
提案手法
- Adapt sparse autoencoders (SAEs) trained on Llama backbones to multimodal activations from LLaVA-MORE.
- Use stage-wise model diffing to detect features that rotate and gain visual preference.
- Identify spatial features by comparing performance under spatial prompts vs. neutral prompts and filtering lexical artifacts.
- Apply attribution patching to locate the mid-layer attention heads driving spatial features.
- Perform ablations to test the causal involvement of spatial features in spatial reasoning tasks.

実験結果
リサーチクエスチョン
- RQ1How does multimodal fine-tuning reshape language-backbone representations in VLMs?
- RQ2Which features become vision-preferring and how do they reorient to encode spatial relations?
- RQ3Which attention heads causally drive spatially grounded features and how are they organized?
主な発見
| Layer | Feature | Delta VSR Acc | Delta VQA Acc | Delta Ctrl | VSR Relation | VSR OR |
|---|---|---|---|---|---|---|
| 7 | 15870 | -15.54 | -0.10 | -0.88 | above | 4.32 |
| 11 | 27061 | -12.77 | -0.40 | 0.00 | across from | 8.03 |
| 9 | 15404 | -11.19 | -0.80 | 1.08 | below | 5.60 |
| 14 | 17873 | -10.21 | -0.30 | -1.71 | at the right side of | 7.17 |
| 12 | 23874 | -9.05 | -0.40 | -0.95 | left of | 9.10 |
| 18 | 29948 | -7.98 | -0.30 | 0.00 | beside | 8.36 |
- A small subset of features (~5%) shows visual preference and strong geometric rotation after multimodal adaptation.
- A defined spatial feature set consistently activates for questions about placement, relative position, and orientation.
- Attribution patching reveals a sparse group of mid-layer heads responsible for spatial grounding.
- Ablation of top spatial features causes substantial drops in spatial reasoning performance (9–16 points on VSR) with limited impact on general VQA, indicating functional specificity.
- The same heads recur across related spatial relations, suggesting structured pathways for visual grounding.
- The approach provides a feature-level mechanistic view of multimodal adaptation, complementing higher-level alignment analyses.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。