QUICK REVIEW

[论文解读] LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Yuzhang Shang, Mu Cai|arXiv (Cornell University)|Mar 22, 2024

Topic Modeling被引用 5

一句话总结

本论文提出 PruMerge，一种自适应视觉令牌约简方法，可在保持性能的前提下裁剪视觉令牌，使得平均仅用约 6.9% 的令牌即可获得类似结果。

ABSTRACT

Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. However, due to the inherent design of the Transformer architecture, the computational costs of these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism that identifies significant spatial redundancy among visual tokens. In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. Specifically, to metric the importance of each token, we exploit the sparsity observed in the visual encoder, characterized by the sparse distribution of attention scores between the class token and visual tokens. This sparsity enables us to dynamically select the most crucial visual tokens to retain. Subsequently, we cluster the selected (unpruned) tokens based on their key similarity and merge them with the unpruned tokens, effectively supplementing and enhancing their informational content. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 14 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.

研究动机与目标

解决由于长视觉令牌序列导致的 LMM 高计算成本。
开发一种自适应令牌减少机制，根据图像内容选择信息量高的令牌。
在显著减少令牌数量的同时，保持模型在视觉问答和推理任务上的性能。
引入令牌合并策略，以补充被裁剪令牌所带来的信息。
展示对 LLaVA-1.5 的即插即用适用性，无需大量重新训练。

提出的方法

通过使用类别令牌对空间令牌的注意力分布，利用基于异常值（IQR）的选择来识别重要的视觉令牌。
基于对 [CLS] 令牌的注意力，将令牌自适应裁剪到目标 m << n。
使用 k 近邻对选定的令牌进行聚类，并通过加权平均更新聚类中心（Token Supplement）。
利用 K 的点积相似度将被裁剪的令牌并入相似聚类，以丰富保留的令牌信息。
可选地对大语言模型进行微调（LoRA），以更好地适应降令牌 regime。
提供一个 PruMerge+ 版本，添加对额外令牌的空间均匀采样以稳定性能。

实验结果

研究问题

RQ1在多样化的 VQA 与推理基准上，减少视觉令牌是否能保持性能？
RQ2在不产生显著退化的前提下，可以实现多少令牌的减少？
RQ3自适应令牌选择是否在各任务上优于固定或均匀采样策略？
RQ4对被裁剪令牌进行合并是否可以缓解因过度裁剪而导致的信息损失？

主要发现

方法	LLM	Res.	PT	IT	VQA v2	SQA I	VQA T	POPE	MME	MMB
BLIP-2	Vicuna-13B	224	129M	-	41.0	61	42.5	85.3	1293.8	-
InstructBLIP	Vicuna-7B	224	129M	1.2M	-	60.5	50.1	-	-	36
InstructBLIP	Vicuna-13B	224	129M	1.2M	-	63.1	50.7	78.9	1212.8	-
Shikra	Vicuna-13B	224	600K	5.5M	77.4	-	-	-	-	58.8
IDEFICS-9B	LLaMA-7B	224	353M	1M	50.9	-	25.9	-	-	48.2
IDEFICS-80B	LLaMA-65B	224	353M	1M	60.0	-	30.9	-	-	54.5
Qwen-VL	Qwen-7B	448	1.4B	50M	78.8	67.1	63.8	-	-	38.2
Qwen-VL-Chat	Qwen-7B	448	1.4B	50M	78.2	68.2	61.5	-	1487.5	60.6
LLaVA-1.5	Vicuna-7B	336	558K	665K	78.5	66.8	58.2	85.9	1510.7	64.3
LLaVA-1.5 + PruMerge	Vicuna-7B	336	558K	665K	72.0	68.5	56.0	76.3	1350.3	60.9
LLaVA-1.5	Vicuna-13B	336	558K	665K	80.0	71.6	61.3	85.9	1531.3	67.7
LLaVA-1.5 + PruMerge	Vicuna-13B	336	558K	665K	72.8	71.0	58.4	78.5	1428.2	62.3
LLaVA-1.5 + PruMerge +	Vicuna-13B	336	558K	665K	77.8	71.0	58.6	84.4	1485.5	65.7

将 PruMerge 应用于 LLaVA-1.5 时，视觉令牌降至约 5.5%（平均约 32 个令牌），且保持了可比的性能。
在六个基准测试中，LLaVA-PruMerge 与原始 LLaVA-1.5 相比具有竞争力的结果，并优于一些基线（如 BLIP-2、InstructBLIP）。
PruMerge+ 通过扩展令牌选择范围和空间抽样，进一步降低令牌数量（约 4x），且几乎不损失性能。
效率分析显示显著降低 FLOP/内存开销；例如在 ViT-7B/INT4 情况下，使用 PruMerge 时预填充成本和总成本显著下降。
通过微调训练可进一步提升结果，在 ScienceQA、TextVQA、POPE 与 MME 等任务上表现更好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。