QUICK REVIEW

[论文解读] TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Zhengqing Yuan, Zhaoxu Li|arXiv (Cornell University)|Dec 28, 2023

Multimodal Machine Learning Applications被引用 7

一句话总结

TinyGPT-V 是一个参数高效的多模态 LLM，基于 2.8B 参数，构建于 Phi-2；通过利用 BLIP-2/CLIP 视觉模块和轻量级训练策略，在 24G GPU 训练和 8G 设备推理条件下实现与视觉-语言任务的竞争力。

ABSTRACT

In recent years, multimodal large language models (MLLMs) such as GPT-4V have demonstrated remarkable advancements, excelling in a variety of vision-language tasks. Despite their prowess, the closed-source nature and computational demands of such models limit their accessibility and applicability. This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks, including image captioning (IC) and visual question answering (VQA). Leveraging a compact yet powerful architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. With a training regimen optimized for small backbones and employing a diverse dataset amalgam, TinyGPT-V requires significantly lower computational resources 24GB for training and as little as 8GB for inference without compromising on performance. Our experiments demonstrate that TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable results in VQA and image inference tasks to its larger counterparts while being uniquely suited for deployment on resource-constrained devices through innovative quantization techniques. This work not only paves the way for more accessible and efficient MLLMs but also underscores the potential of smaller, optimized models in bridging the gap between high performance and computational efficiency in real-world applications. Additionally, this paper introduces a new approach to multimodal large language models using smaller backbones. Our code and training weights are available in the supplementary material.

研究动机与目标

推动开发高性价比、效率高的多模态 LLM，使其能够媲美更大规模模型。
提出 TinyGPT-V 作为利用 Phi-2 与预训练视觉模块的小背骨 MLLM。
Demonstrate training strategies and normalization techniques that stabilize learning in small LLMs for multimodal tasks.
Showcase the model's performance across diverse vision-language benchmarks despite limited parameters.

提出的方法

将视觉编码器投影（Q-Former）与 2.8B Phi-2 语言骨架融合的架构。
使用冻结的视觉模块（BLIP-2 或 CLIP），仅训练投影层和 LoRA 以提高效率。
在训练中引入 LLaMA-2 的后归一化/输入归一化、MHA 之后的 RMS 归一化，以及 Query-Key 归一化以稳定训练。
采用四阶段训练流程：热身、预训练、指令微调、以及多任务学习。
采用六个任务标识符的多任务指令模板，统一多样的视觉‑语言任务。

实验结果

研究问题

RQ1一个 2.8B 的小型 LLM（Phi-2）与预训练视觉模块结合，能否实现具有竞争力的 MLLM 性能？
RQ2为了在小型骨架下稳定多模态学习，需要哪些训练策略（归一化、LoRA、量化）？
RQ3与更大规模的开源 MLLM 相比，TinyGPT-V 在标准 VQA、定位和引用任务上的表现如何？

主要发现

TinyGPT-V（2.8B 参数）在多项视觉‑语言基准上取得具有竞争力的结果，尽管比 13B+ 模型小得多。
在 VSR 零样本测试中，TinyGPT-V 得分 53.2%，为已报道的 2.8B–13B 基线中最高。
在 IconVQ 与 HM 任务中，TinyGPT-V 分别达到 43.3% 与 53.2%，与较大模型具有竞争力。
阶段性训练配合归一化（RMS Norm、QK Norm）和 LoRA 对防止梯度消失、在各阶段实现低损失至关重要。
由于高效的架构与量化，TinyGPT-V 可以在单颗 24G GPU 上训练，并可在 8G 设备上部署。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。