QUICK REVIEW

[論文レビュー] Towards Understanding Multimodal Fine-Tuning: Spatial Features

Lachin Naghashyar, Hunar Batra|arXiv (Cornell University)|Feb 6, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

論文は段階的モデル差分解析をビジョン-言語モデルへ拡張し、言語バックボーンにおける空間的に根拠づけられた特徴を特定し、因果属性付けとアブレーションを通じて小さな注意ヘッドの集合へと追跡する。

ABSTRACT

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

研究の動機と目的

Understand how pretrained language backbones adapt under visual grounding during multimodal fine-tuning.
Identify vision-preferring features that rotate or reorient to encode spatial relations.
Isolate a compact set of features that underpin spatial reasoning and trace their causal drivers.

提案手法

Adapt sparse autoencoders (SAEs) trained on Llama backbones to multimodal activations from LLaVA-MORE.
Use stage-wise model diffing to detect features that rotate and gain visual preference.
Identify spatial features by comparing performance under spatial prompts vs. neutral prompts and filtering lexical artifacts.
Apply attribution patching to locate the mid-layer attention heads driving spatial features.
Perform ablations to test the causal involvement of spatial features in spatial reasoning tasks.

Figure 1 : SAE adaptation on LLaVA-MORE . Top: Mean fraction of variance unexplained (FVU) across layers on the validation set. Bottom: Summary statistics of FVU values on the validation set, with decimal alignment; the lowest mean is highlighted in bold .

実験結果

リサーチクエスチョン

RQ1How does multimodal fine-tuning reshape language-backbone representations in VLMs?
RQ2Which features become vision-preferring and how do they reorient to encode spatial relations?
RQ3Which attention heads causally drive spatially grounded features and how are they organized?

主な発見

Layer	Feature	Delta VSR Acc	Delta VQA Acc	Delta Ctrl	VSR Relation	VSR OR
7	15870	-15.54	-0.10	-0.88	above	4.32
11	27061	-12.77	-0.40	0.00	across from	8.03
9	15404	-11.19	-0.80	1.08	below	5.60
14	17873	-10.21	-0.30	-1.71	at the right side of	7.17
12	23874	-9.05	-0.40	-0.95	left of	9.10
18	29948	-7.98	-0.30	0.00	beside	8.36

A small subset of features (~5%) shows visual preference and strong geometric rotation after multimodal adaptation.
A defined spatial feature set consistently activates for questions about placement, relative position, and orientation.
Attribution patching reveals a sparse group of mid-layer heads responsible for spatial grounding.
Ablation of top spatial features causes substantial drops in spatial reasoning performance (9–16 points on VSR) with limited impact on general VQA, indicating functional specificity.
The same heads recur across related spatial relations, suggesting structured pathways for visual grounding.
The approach provides a feature-level mechanistic view of multimodal adaptation, complementing higher-level alignment analyses.

Figure 2 : Distribution of SAE features by visual energy and cosine similarity. All features are shown in gray; adapted features are highlighted in pink. Spatial candidates are marked with blue squares, and the subset used for downstream analysis is shown as red crosses.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。