QUICK REVIEW

[논문 리뷰] Towards Understanding Multimodal Fine-Tuning: Spatial Features

Lachin Naghashyar, Hunar Batra|arXiv (Cornell University)|2026. 02. 06.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

논문은 stage-wise 모델 diffing을 비전–언어 모델로 확장하여 언어 백본에서 공간적으로 근거를 가진 특징을 식별하고, 이를 인과적 귀속 및 제거실험을 통해 소수의 어텐션 헤드로 추적한다.

ABSTRACT

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

연구 동기 및 목표

사전 학습된 언어 백본이 멀티모달 미세조정 동안 시각적 기반 하에서 어떻게 적응하는지 이해한다.
시각 정보를 선호하는 특징 중 공간 관계를 인코딩하기 위해 회전하거나 재배열되는 특징을 식별한다.
공간 추론을 뒷받침하는 조밀한 특징 집합을 고립하고 그들의 인과적 원인을 추적한다.

제안 방법

Llama 백본에서 학습된 희소 오토인코더(SAEs)를 LLaVA-MORE의 멀티모달 활성에 적응시킨다.
단계별 모델 차이를 사용하여 회전하고 시각적 선호를 얻는 특징을 탐지한다.
공간 프롬프트와 중립 프롬프트에서의 성능을 비교하고 어휘적 인공물을 필터링하여 공간 특징을 식별한다.
공간 특징을 주도하는 중간 층 어텐션 헤드를 찾기 위해 어트리뷰션 패칭을 적용한다.
공간 추론 과제에서 공간 특징의 인과적 관여를 확인하기 위한 절단(아블레이션)을 수행한다.

Figure 1 : SAE adaptation on LLaVA-MORE . Top: Mean fraction of variance unexplained (FVU) across layers on the validation set. Bottom: Summary statistics of FVU values on the validation set, with decimal alignment; the lowest mean is highlighted in bold .

실험 결과

연구 질문

RQ1멀티모달 미세조정이 VLM의 언어 백본 표현을 어떻게 재구성하는가?
RQ2어떤 특징이 시각 선호를 가지게 되며 어떻게 재배열되어 공간 관계를 인코딩하는가?
RQ3어떤 어텐션 헤드가 공간적으로 근거 있는 특징을 인과적으로 이끌고 있으며 그것들이 어떻게 구성되어 있는가?

주요 결과

레이어	특징	Delta VSR Acc	Delta VQA Acc	Delta Ctrl	VSR 관계	VSR OR
7	15870	-15.54	-0.10	-0.88	above	4.32
11	27061	-12.77	-0.40	0.00	across from	8.03
9	15404	-11.19	-0.80	1.08	below	5.60
14	17873	-10.21	-0.30	-1.71	at the right side of	7.17
12	23874	-9.05	-0.40	-0.95	left of	9.10
18	29948	-7.98	-0.30	0.00	beside	8.36

멀티모달 적응 이후 소수의 특징 집합(~5%)이 시각적 선호와 강한 기하학적 회전을 보인다.
배치, 상대 위치 및 방향에 관한 질문에서 정의된 공간 특징 집합이 일관되게 활성화된다.
어트리뷰션 패칭은 공간 기반에 관여하는 중간 층 헤드의 희소한 그룹을 드러낸다.
상위 공간 특징의 제거는 공간 추론 성능에 상당한 하락을 일으키며(VSR에서 9–16 포인트), 일반 VQA에는 제한적인 영향으로 기능 특이성을 시사한다.
관련된 공간 관계에서도 동일한 헤드가 반복되어 시각적 기반 형성을 위한 구조화된 경로를 시사한다.
이 접근법은 고수준 정렬 분석을 보완하는 특징 수준의 기계적 관점을 멀티모달 적응에 제공한다.

Figure 2 : Distribution of SAE features by visual energy and cosine similarity. All features are shown in gray; adapted features are highlighted in pink. Spatial candidates are marked with blue squares, and the subset used for downstream analysis is shown as red crosses.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.