QUICK REVIEW

[论文解读] Same Answer, Different Representations: Hidden instability in VLMs

Farooq Ahmad Wani, Alessandro Suglia|arXiv (Cornell University)|Feb 6, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

论文引入一个面向表示的、面向频率的鲁棒性框架，用于视觉-语言模型（VLMs），并揭示在扰动下即使输出保持不变，也会出现隐藏的内部漂移。

ABSTRACT

The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.

研究动机与目标

Motivate robustness evaluation beyond output invariance to detect hidden multimodal instability in VLMs.
Propose a representation-aware framework that measures embedding drift, spectral changes, and structural smoothness in VLMs.
Identify failure modes and quantify how perturbations affect reasoning and hallucination tasks.
Assess robustness across model scales, datasets, and architectures to understand scaling effects.

提出的方法

Develop an evaluation framework that couples label stability with internal representation metrics and margin dynamics.
Measure Embedding Stability, Dirichlet Energy (structural smoothness), Perturbation Drift vs. Control Drift, and Drift-to-Prior across multiple prompt regimes.
Use a log-likelihood MCQ scoring protocol to track margin dynamics and decision boundaries.
Evaluate six perturbation families (translation, padding/cropping, scaling, rotation, text overlays) including semantic overlays and occlusions.
Analyze across SEEDBench, MMMU, and POPE to study cross-dataset and cross-architecture robustness.

Figure 1: Cosine distance ( $1-\cos$ ), Drift versus control drift for the ans_mcq_free embedding under Translation and Textoverlay perturbation. Blue shows perturbation-induced drift relative to the base image; orange shows control drift (base image versus randomly sampled other images). Left: Tran

实验结果

研究问题

RQ1Does output-level robustness in VLMs mask internal representation drift under meaning-preserving perturbations?
RQ2How do perturbations affect internal embeddings, spectral content, and local token smoothness in VLMs?
RQ3Does model scale improve robustness, or can larger models be more fragile under certain perturbations?
RQ4How do perturbations influence reasoning versus hallucination tasks in VLMs?
RQ5What is the role of frequency content and cross-frequency coherence in VLM robustness?

主要发现

Perturbation	IFR	IV
Translation	0.062	0.168
Pad/Crop	0.065	0.169
Scale	0.079	0.079
Scale+Pad	0.080	0.100
Rotation	0.122	0.166
TextOverlay(semantic)	0.192	0.239
TextOverlay(random)	0.064	0.086
TextOverlay(empty)	0.043	0.044
Any (union)	0.079	0.376

37.6% of images experience at least one perturbation leading to a decision flip across perturbations.
Text overlays are the most disruptive, with IFR ~19.2% and IV ~23.9%.
Representation drift can be large even when predictions stay the same, and drift magnitudes often rival inter-image variability.
Model scale does not guarantee robustness; larger models show equal or greater representation drift and error transitions under perturbations.
Perturbations harm reasoning tasks but can reduce false positives on hallucination benchmarks by promoting more conservative predictions.
Across datasets and architectures, robustness failures persist and do not scale monotonically with capacity.

Figure 2: Qwen3-VL (Instruct) scaling on SEEDBench. Left: base accuracy versus ground truth. Right: average flip rate under natural perturbations (lower is better).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。