QUICK REVIEW

[論文レビュー] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Gen Luo, Yiyi Zhou|arXiv (Cornell University)|May 24, 2023

Multimodal Machine Learning Applications被引用数 45

ひとこと要約

この論文は、軽量なアダプターを用いて視覚と言語タスクへ大規模言語モデル（LLM）を効率的に適応させる Mixture-of-Modality Adaptation（MMA）を提案し、少数の訓練可能パラメータでエンドツーエンドの訓練を実現します。LaVIN という LLaMA をベースとした視覚言語指示モデルを導入し、訓練コストを大幅に削減しつつ競争力のある性能を達成します。

ABSTRACT

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at https://luogen1996.github.io/lavin.

研究の動機と目的

LLMs（大型言語モデル）を大規模なVL事前訓練なしで、手頃な Vision-Language（VL）適応を実現する動機づけ。
画像エンコーダとLLMを軽量アダプターと動的ルーティング機構で接続する MMA の導入。
エンドツーエンド訓練を小さなパラメータフットプリントで実現し、マルチモーダル科学 QA と対話タスクで検証。

提案手法

Mixture-of-Modality Adapter（MM-Adapter）を導入し、モダリティトークンとソフトマックス型ルーターを用いて一モーダルとマルチモーダルの適応間を動的にルーティング。
Mixture-of-Modality Training（MMT）を定義し、バックボーンの LLM および画像エンコーダを凍結したまま、アダプターのみをエンドツーエンド目的でファインチューニング。
MMA を LLaMA に適用し、LaVIN を構築。画像エンコーダとして CLIP-ViT、視覚特徴として六つの [cls] ViT トークンを使用。
小さなパラメータフットプリントを持つビジュアルネックを採用（例：3–5M の訓練可能パラメータ）し、テキストのみの指示とテキスト-画像指示の混成で訓練。
テキストのみ指示と画像-テキスト指示の自動的な切替を可能にするエンドツーエンド訓練を通じ、マルチモーダル LLM を共同最適化。

実験結果

リサーチクエスチョン

RQ1MMA は訓練コストとパラメータ数を大幅に削減しつつ、競争力のある VL 指示調整性能を達成できるか？
RQ2LaVIN は VL pre-training なしで NLP 能力を維持しつつ VL 理解を獲得できるか？
RQ3推論時にテキストのみ指示と画像-テキスト指示の自動切替を MMA はいかに実現するか？
RQ4アダプターサイズ、画像エンコーダの強さ、LLM のスケールが ScienceQA およびマルチモーダル対話性能に与える影響は？

主な発見

手法	#T-パラメータ	NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12	平均
LaVIN-7B	3.8M	89.25	94.94	85.24	88.51	87.46	88.08	90.16	88.07	89.41
LaVIN-13B	5.4M	90.32	94.38	87.73	89.44	87.65	90.31	91.19	89.26	90.50

MMA を用いた LaVIN は最先端のマルチモーダル LLMs に対抗する競争力のある結果を達成しつつ、訓練時間とストレージを大幅に削減（例：ScienceQA で 3.8M 訓練可能パラメータ、訓練時間 1.4 時間）まで低減。
LaVIN-13B は ScienceQA テストセットで約 90.83 の正解率を達成し、5.4M パラメータ予算で、いくつかのパラメータ効率的ベースラインを上回る。
MMT はアブレーションの中で最大の利得を生み出し、より強力な画像エンコーダと結合した場合に平均精度で最大 +4.69 の改善をもたらす。
画像エンコーダと LLM の共同最適化、さらにより強力な画像エンコーダ（ViT-L/14）を用いることで、エンドツーエンドのVL 適応の利益を裏付ける。
COCO キャプショニングでは、LaVIN は前学習データと更新されたパラメータ数が少なくても競争力のある CIDEr スコアを達成し、BLIP-2 および LLaVA に比べて訓練コストが大幅に低い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。