Skip to main content
QUICK REVIEW

[论文解读] Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Han Zhao, M. Zhang|arXiv (Cornell University)|Mar 21, 2024
Natural Language Processing Techniques被引用 6
一句话总结

Cobra 构建一个通过将线性时间 Mamba 状态空间模型与视觉编码器整合的多模态大语言模型,在保持竞争性准确性的同时,推理速度比 Transformer 基线快 3–4 倍,参数量约为较大模型的 43%。

ABSTRACT

In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.

研究动机与目标

  • 阐明基于 Transformer 的多模态大语言模型在二次复杂度方面的效率限制。
  • 提出使用线性时间状态空间模型(Mamba)进行多模态处理的 Cobra 架构。
  • 研究模态融合方案,以有效整合视觉与语言信息。
  • 展示 Cobra 在标准 VLM 基准上的竞争性能和更高的速度。
  • 展示在保持性能的同时潜在的参数量减少。

提出的方法

  • 使用视觉编码器栈(DINOv2 + SigLIP)从图像中提取视觉表示。
  • 引入投影模块将视觉标记对齐到 Mamba 标记空间(MLP 或替代方法)。
  • 采用 Mamba 作为骨干,包含 64 个块,以自回归方式处理连接的视觉与文本嵌入。
  • 在 Mamba 内融合视觉与语言模态,探索不同的融合方案以优化多模态表示。
  • 通过对整个 LLM 主干和投影器在大约 1.2M 图文样本上进行两轮训练,对多个数据集进行端到端微调。
Figure 1 : Illustration of tokens per second and times in our proposed Cobra and baselines.
Figure 1 : Illustration of tokens per second and times in our proposed Cobra and baselines.

实验结果

研究问题

  • RQ1当与视觉编码器搭配时,线性时间状态空间模型(Mamba)是否能有效支持多模态大语言建模?
  • RQ2哪些视觉编码器和投影策略能够最好地保留视觉信息,以实现 Cobra 的准确多模态推理?
  • RQ3相对于参数预算相近的 Transformer 基线,Cobra 在开放式 VQA 与封闭集空间/幻觉基准中的表现如何?
  • RQ4采用状态空间骨干的 MLLMs 相比 Transformer 基线,在推理速度与内存使用方面的提升有哪些?

主要发现

模型LLMVQA_v2GQAVizWizVQA_TVSRPOPE
OpenFlamingoMPT-7B52.7-27.533.6--
BLIP-2Vicuna-13B-41.019.642.550.9-
MiniGPT-4Vicuna-7B32.2-----
InstructBLIPVicuna-7B-49.234.550.154.3-
InstructBLIPVicuna-13B-49.533.450.752.1-
ShikraVicuna-13B77.4-----
IDEFICSLLaMA-7B50.9-35.525.9--
IDEFICSLLaMA-75B60.0-36.030.9--
Qwen-VLQwen-7B78.259.335.263.8--
LLaVA v1.5Vicuna-7B78.562.050.058.2-85.9
PrismLLaMA-7B81.065.352.859.759.688.1
ShareGPT4VVicuna-7B80.657.2----
MoE-LLaVAStableLM-1.6B76.760.336.250.1-85.7
MoE-LLaVAPhi2-2.7B77.661.443.951.4-86.3
Llava-PhiPhi2-2.7B71.4-35.948.6-85.0
MobileVLM v2MobileLLaMA-2.7B-61.1-57.5-84.7
TinyLLaVAPhi2-2.7B79.962.0-59.1-86.4
Cobra (ours)Mamba-2.8B75.958.552.046.063.688.0
  • Cobra 在具备线性序列建模的同时,与高效的最先进方法(如 LLaVA-Phi、TinyLLaVA、MobileVLM v2)相比,取得了竞争性的性能。
  • Cobra 在涉及空间关系判断的封闭集合任务以及减少视觉幻觉方面表现出强健的鲁棒性。
  • Cobra 具有大约 LLaVA v1.5 7B 的 43% 参数,在若干基准上达到可比性能,突出效率优势。
  • Cobra 的推理速度显著更快(例如,在相似尺寸下比 MobileVLM v2 和 TinyLLaVA 快 3×–4×)。
  • 消融研究表明将 DINOv2 与 SigLIP 结合可提升结果,对话微调的 Mamba 模型进一步微调可获得更好的指令跟随性能。
Figure 2 : Detailed architecture of Cobra (right) that takes Mamba as the backbone consisting of identical Mamba blocks (left). The parameters of vision encoders are frozen during training.
Figure 2 : Detailed architecture of Cobra (right) that takes Mamba as the backbone consisting of identical Mamba blocks (left). The parameters of vision encoders are frozen during training.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。