QUICK REVIEW

[論文レビュー] TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Zhengqing Yuan, Zhaoxu Li|arXiv (Cornell University)|Dec 28, 2023

Multimodal Machine Learning Applications被引用数 7

ひとこと要約

TinyGPT-V は Phi-2 を基盤としたパラメータ効率の高いマルチモーダルLLM で、2.8B パラメータ。BLIP-2/CLIP の視覚モジュールと軽量な学習戦略を活用して、推論時には 8G デバイス、学習時には 24G GPU で競争力のある視覚言語タスクを達成する。

ABSTRACT

In recent years, multimodal large language models (MLLMs) such as GPT-4V have demonstrated remarkable advancements, excelling in a variety of vision-language tasks. Despite their prowess, the closed-source nature and computational demands of such models limit their accessibility and applicability. This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks, including image captioning (IC) and visual question answering (VQA). Leveraging a compact yet powerful architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. With a training regimen optimized for small backbones and employing a diverse dataset amalgam, TinyGPT-V requires significantly lower computational resources 24GB for training and as little as 8GB for inference without compromising on performance. Our experiments demonstrate that TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable results in VQA and image inference tasks to its larger counterparts while being uniquely suited for deployment on resource-constrained devices through innovative quantization techniques. This work not only paves the way for more accessible and efficient MLLMs but also underscores the potential of smaller, optimized models in bridging the gap between high performance and computational efficiency in real-world applications. Additionally, this paper introduces a new approach to multimodal large language models using smaller backbones. Our code and training weights are available in the supplementary material.

研究の動機と目的

コストを抑えた効率的なマルチモーダルLLM を開発し、より大きなモデルに対抗できることを示す。
TinyGPT-V を Phi-2 と視覚モジュールを活用した小型バックボーンの MLLM として提案する。
小型 LLM におけるマルチモーダル学習を安定させるための学習戦略と正規化手法を示す。
制限されたパラメータ数にもかかわらず、さまざまな視覚言語ベンチマークでのモデル性能を示す。

提案手法

視覚エンコーダ projection (Q-Former) を 2.8B Phi-2 言語バックボーンと統合するアーキテクチャ。
凍結された視覚モジュール（BLIP-2 または CLIP）を用い、projection 層と LoRA のみを学習して効率化。
トレーニング安定化のため、LLaMA-2 後ノルム / 入力ノルム、MHA 後の RMS ノルム、Query-Key Normalization を導入。
4 段階の訓練パイプライン（ウォームアップ、事前学習、指示微調整、マルチタスク学習）。
6 つのタスク識別子を用いたマルチタスク指示テンプレートを採用し、多様な視覚言語タスクを統合。

実験結果

リサーチクエスチョン

RQ1小型の 2.8B LLM（Phi-2）と事前学習済み視覚モジュールを組み合わせて、競争力のある MLLM 性能を達成できるか。
RQ2小型バックボーンでのマルチモーダル学習を安定させるために必要な学習戦略（正規化、LoRA、量子化）は何か。
RQ3TinyGPT-V は標準的な VQA、 grounding、 referencing タスクで、より大きなオープンソース MLLMs と比較してどう性能が出るか。

主な発見

Method	Parameters	Grounding	GQA	VSR	IconVQ	VizWiz	HM
Flamingo	9B	✗	-	31.8	-	28.8	57.0
BLIP-2	13B	✗	41.0	50.9	40.6	19.6	53.7
LLaVA	13B	✗	41.3	51.2	43.0	-	-
Shikra	13B	✓	-	-	-	-	-
InstructBLIP	13B	✗	49.5	52.1	44.8	33.4	57.5
MiniGPT-4	13B	✗	30.8	41.6	37.6	-	-
TinyGPT-V	2.8B	✓	33.6	53.2	43.3	24.8	53.2

TinyGPT-V（2.8B パラメータ）は、13B+ モデルよりは小さいにもかかわらず、複数の視覚言語ベンチマークで競争力のある結果を達成。
VSR のゼロショットでは TinyGPT-V が 53.2% を記録、報告された 2.8B–13B ベースラインの中で最高。
IconVQ および HM タスクでは、それぞれ 43.3% と 53.2% を達成、より大きなモデルと競合。
正規化（RMS Norm、QK Norm）と LoRA を組み合わせた段階的トレーニングが、勾配消失を防ぎ、各段階で低い損失を達成するのに重要。
TinyGPT-V は 1 台の 24G GPU で学習でき、効率的なアーキテクチャと量子化により 8G デバイスでの展開が可能。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。