QUICK REVIEW

[論文レビュー] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han|arXiv (Cornell University)|Apr 28, 2023

Multimodal Machine Learning Applications被引用数 117

ひとこと要約

LLaMA-Adapter V2は、バイアスチューニング、早期フュージョン、および分離パラメータのジョイント学習を組み合わせて、限られた画像テキストと指示データのみを用いてオープンエンドな視覚指示追従を可能にするLLaMA-Adapterを拡張します。さらに専門的な視覚システムを任意で統合します。総訓練パラメータは約14Mで（LLaMAの約0.04%）、強力な多モーダルおよび言語指示性能を達成します。

ABSTRACT

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

研究の動機と目的

大規模なマルチモーダルデータを用いずに、指示追従型の視覚モデルの構築を促進する。
凍結されたLLMに視覚情報を融合するためのパラメータ効率の高い手法を導入する。
画像-テキストの整合を言語指示学習から分離するジョイント学習スキームを提案する。
視覚理解を高めるために外部のエキスパートビジョンシステムとの統合を可能にする。

提案手法

正規化層を凍結解除しつつ、すべての線形モジュールに訓練可能なバイアスとスケールを追加して線形層のバイアスを調整する。
分離されたパラメータグループでのジョイント学習：画像-テキストキャプションデータのために視覚プロジェクションと早期ゼロ初期化のアテンションを訓練し、指示データのために遅後適応プロンプト、ゲーティング、および追加のLLaMAパラメータを訓練する。
視覚トークンを層を跨ぐ適応プロンプトへ挿入するのではなく、初期のLLM層で視覚知識を早期フュージョンする。
推論時にエキスパートモデル（キャプション生成/ OCR/検出）を組み込み、追加の訓練なしで画像理解を高める。
52Kの画像-テキストキャプション（COCO）と567Kのキャプションデータ、さらに80Kの会話データで訓練し、7B–65BのLLaMAバックボーンを使用。
適度なパラメータフットプリント： ~14M trainable parameters、 ~0.04% of the full model。

実験結果

リサーチクエスチョン

RQ1限られたマルチモーダルデータと最小限のパラメータ更新で、LLaMA-Adapter V2はオープンエンドの視覚指示追従を達成できるか。
RQ2早期フュージョン戦略は、画像-テキストの整合と言語指示タスクのバランスを改善するか。
RQ3分離パラメータを用いたジョイント学習は、視覚-言語の整合と指示追従の干渉にどのように影響するか。
RQ4外部エキスパート視覚システムの統合がゼロショットのマルチモーダル推論に与える影響は何か。

主な発見

LLaMA-Adapter V2は言語指示追従で前モデルを上回り、マルチターン対話をサポートします。
早期フュージョン戦略は視覚と言語のファインチューニングのバランスを効果的に取り、品質の高いマルチモーダルデータなしで視覚指示学習を実現します。
分離パラメータのジョイント学習は、画像-テキストキャプションと言語指示からの学習を、破壊的な干渉なしに可能にします。
推論時に外部エキスパートシステムを組み込むことで、コストの高い共同の視覚と言語の事前学習を必要とせずに画像理解を高めます。
14Mのtrainable parametersで、LLaMA-Adapter V2は強力な視覚指示機能を達成しつつ、非常にパラメータ効率的です。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。