QUICK REVIEW

[论文解读] VL-Mamba: Exploring State Space Models for Multimodal Learning

Yanyuan Qiao, Zheng Yu|arXiv (Cornell University)|Mar 20, 2024

Speech and dialogue systems被引用 10

一句话总结

VL-Mamba 引入一个基于状态空间模型的多模态大语言模型，使用 Vision Selective Scan 基的多模态连接器，在八个基准上实现具竞争力的结果。

ABSTRACT

Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.

研究动机与目标

动机：提出在多模态学习中使用状态空间模型（SSMs），以解决 Transformer 规模计算的问题。
提出 VL-Mamba，通过用 Mamba LLM 替代基于 Transformer 的骨干网络，并新增基于二维视觉选择性扫描的 MMC。
研究不同视觉编码器、LLM 变体和 MMC 架构如何影响多模态性能。
在标准多模态基准上展示具竞争力的结果，并提供消融研究以理解各组成部分的贡献。

提出的方法

使用预训练的 Mamba LLM 作为骨干语言模型，替代基于 Transformer 的 LLM。
将 Vision Transformer 作为视觉编码器以提取图像补丁特征。
引入 MultiModal Connector (MMC) 及 Vision Selective Scan (VSS)，将二维视觉数据与一维序列建模桥接。
探索两种二维扫描机制（双向扫描 Bidirectional-Scan 和交叉扫描 Cross-Scan）以高效捕获视觉上下文。
通过广泛的消融，评估三种 MMC 变体（MLP、VSS-MLP、VSS-L2）以及两种视觉编码器（CLIP-ViT-L 和 SigLIP-SO）。
在八个多模态基准上进行实验，将 VL-Mamba 与最先进的 MLLMs 进行比较。

实验结果

研究问题

RQ1用 Mamba LLM 替代 Transformer 骨干网络是否可以提升多模态任务的效率和扩展性？
RQ2基于 2D Vision Selective Scan 的 MMC 在将非因果视觉数据与因果状态空间建模桥接方面有多有效？
RQ3不同视觉编码器、MMC 架构和扫描机制对多模态基准的影响是什么？
RQ4VL-Mamba 能否在参数量较少、预训练数据更少的情况下实现与一些大型 MLLMs 相竞争的性能？

主要发现

方法	LLM	PT	IT	VQAv2	GQA	SQA I	VQAT	POPE	MME	MMBench	MMVet
BLIP-2	Vicuna-13B	129M	-	41.0	41.0	61.0	42.5	85.3	1293.8	–	22.4
MiniGPT-4	Vicuna-7B	5M	5K	-	32.2	-	-	-	581.7	23.0	-
InstructBLIP	Vicuna-7B	129M	1.2M	-	49.2	60.5	50.1	-	-	36	26.2
InstructBLIP	Vicuna-13B	129M	1.2M	-	49.5	63.1	50.7	78.9	1212.8	-	25.6
Shikra	Vicuna-13B	600K	5.5M	77.4	-	-	-	-	-	58.8	-
Otter	LLaMA-7B	-	-	-	-	-	-	-	1292.3	48.3	24.6
mPLUG-Owl	LLaMA-7B	2.1M	102K	-	-	-	-	-	967.3	49.4	-
IDEFICS-9B	LLaMA-7B	353M	1M	50.9	38.4	-	25.9	-	-	48.2	-
IDEFICS-80B	LLaMA-65B	353M	1M	60.0	45.2	-	30.9	-	-	54.5	-
Qwen-VL	Qwen-7B	1.4B	50M	78.8	59.3	67.1	63.8	-	-	38.2	-
Qwen-VL-Chat	Qwen-7B	1.4B	50M	78.2	57.5	68.2	61.5	-	1487.5	60.6	-
LLaVA-1.5	Vicuna-7B	558K	665K	78.5	62.0	66.8	58.2	85.9	1510.7	64.3	30.5
LLaVA-1.5	Vicuna-13B	558K	665K	80.0	63.3	71.6	61.3	85.9	1531.3	67.7	35.4
LLaVA-Phi	Phi-2-2.7B	558K	665K	71.4	-	68.4	48.6	85.0	1335.1	59.8	28.9
MobileVLM-3B	MobileLLaMA-2.7B	558K	665K	-	59.0	61.2	47.5	84.9	1288.9	59.6	-
VL-Mamba	Mamba LLM-2.8B	558K	665K	76.6	56.2	65.4	48.9	84.4	1369.6	57.0	32.6

VL-Mamba 在同等规模的其他小型 MLLMs 上实现具竞争力的性能，在部分基准上甚至超越一些更大规模的模型。
以 SigLIP-SO 作为视觉编码器、Mamba-2.8B-Slimpj LLM 的 VL-Mamba 变体在消融实验中显示出强的整体性能。
VSS-L2 MMC 架构和 Bidirectional Scan（BSM）在所有基准测试中通常能获得较强的结果。
VL-Mamba 展示了将状态空间模型应用于多模态学习任务的可行性并获得具竞争力的结果。
消融研究表明，语言模型变体、视觉编码器、MMC 设计和扫描机制都对性能产生显著影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。