QUICK REVIEW

[论文解读] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Qingtao Pan, Zhihao Dou|arXiv (Cornell University)|Mar 11, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

FMVR 是一种即插即用的可视化修复策略，它将压缩的视觉 token 解耦为低频和高频分量以恢复视觉语义，使基于 Matryoshka 的 LMM 训练在最小 FLOPs 开销下实现弹性 token 预算，并保持精度。

ABSTRACT

Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.

研究动机与目标

Motivate the problem of semantic loss when reducing visual tokens in large multimodal models.
Propose FMVR to restore diluted visual semantics via frequency-based decomposition and modulation.
Integrate FMVR with Matryoshka Representation Learning to support elastic inference with varying token budgets.
Demonstrate that FMVR-LLaVA achieves substantial FLOPs reduction while maintaining or improving accuracy across image and video benchmarks.

提出的方法

Disentangle compressed visual representations into low- and high-frequency components using AvgPool and MaxPool during matryoshka token construction.
Apply lightweight learnable modulation parameters to refine high- and low-frequency components.
Fuse frequency-restored tokens to form reinforced nested visual token sets for LMM training (Matryoshka Representation Learning).
Train FMVR in a two-stage regime on LLaVA-based architectures to enable elastic inference with different token budgets.
Evaluate across 10 image-based and 4 video-based benchmarks to validate efficiency and accuracy gains.

Figure 1 : Our FMVR (b) can restore the visual semantics from compressed tokens, alleviating the loss of visual contents in previous token compression methods (a).

实验结果

研究问题

RQ1How does token reduction degrade visual semantics and model reasoning in LMMs?
RQ2Can a frequency-based restoration (FMVR) recover diluted visual semantics from compressed tokens?
RQ3Does integrating FMVR with Matryoshka Representation Learning enable elastic token budgets without sacrificing accuracy?
RQ4What are the efficiency gains (FLOPs, latency) when using FMVR at various token budgets across image and video tasks?

主要发现

Methods	#Vision Tokens	VQAv2	GQA	VisWiz	SQA IMG	VQA Text	POPE	MME	MMBench EN	MMBench CN	MMVet	Avg.
LLaVA-v1.5 baseline	576	78.5	62.0	50.0	66.8	58.2	85.9	1510.7	64.3	58.3	30.5	63.0
Ours 1 token	1	68.3	55.2	49.7	68.6	49.2	81.1	1284.8	60.7	53.4	26.4	57.7
Ours 9 tokens	9	74.5	59.1	50.7	69.9	50.8	84.1	1415.0	64.2	57.5	29.0	61.1
Ours 36 tokens	36	76.5	60.9	52.9	69.5	55.3	85.9	1452.5	65.2	58.3	32.2	62.9
Ours 144 tokens	144	78.6	62.3	55.1	69.7	55.5	86.4	1473.9	65.8	57.6	33.4	63.8
Ours 576 tokens	576	79.2	63.0	56.5	68.9	57.8	87.5	1510.1	65.9	58.0	34.3	64.7

FMVR-LLaVA with significantly fewer visual tokens maintains competitive accuracy across image benchmarks (e.g., 36–144 tokens performing near the 576-token baseline).
FMVR enables large FLOPs reductions (e.g., ×8.9) with only marginal drops in average accuracy on image benchmarks when tokens are reduced.
Across 10 image benchmarks, FMVR-LLaVA with 576 tokens achieves 79.2 VQAv2 score and 64.7 average, closely matching higher-token baselines.
FMVR-LLaVA with as few as 180–720 tokens significantly outperforms other vision-token pruning methods in both image and video tasks.
In video benchmarks, FMVR-LLaVA (720 tokens) reaches 65.9 average, and with 180 tokens surpasses several existing methods in both accuracy and efficiency.
Efficiency analysis shows FMVR adds negligible FLOPs (~6.4e-5) per token restoration step while enabling substantial token reduction and fast prefill times.

Figure 2 : Grad-CAM visualization (576 and 36 visual tokens) shows that the reduction of visual tokens leads to a noticeable degradation in visual focus.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。