QUICK REVIEW

[论文解读] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han|arXiv (Cornell University)|Apr 28, 2023

Multimodal Machine Learning Applications被引用 117

一句话总结

LLaMA-Adapter V2 在 LLaMA-Adapter 的基础上增加偏置微调、早期融合和非重叠参数联合训练，以仅利用少量图像文本和指令数据实现开放式视觉指令跟随，并可选集成专家视觉系统；它增加约14M可训练参数（约占 LLaMA 的 0.04%），以实现强大的多模态与语言指令性能。

ABSTRACT

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

研究动机与目标

Motivate building instruction-following visual models without large-scale multi-modal data.
Introduce parameter-efficient strategies to fuse visual information into a frozen LLM.
Propose a joint training scheme that separates image-text alignment from language instruction learning.
Enable integration with external expert vision systems to boost visual understanding.

提出的方法

Bias tuning of linear layers by adding trainable bias and scale to all linear modules while unfreezing normalization layers.
Joint training with disjoint parameter groups: train visual projections and early zero-initialized attention for image-text caption data; train late adaptation prompts, gating, and additional LLaMA parameters for instruction data.
Early fusion of visual knowledge by injecting visual tokens at early LLM layers rather than into adaptation prompts across layers.
Incorporation of expert models (captioning/OCR/detection) during inference to enhance image understanding without extra training.
Training on 52K image-text captions (COCO) and 567K caption data, plus 80K conversation data, with a 7B–65B LLaMA backbone.
Moderate parameter footprint: ~14M trainable parameters, ~0.04% of the full model.

实验结果

研究问题

RQ1Can LLaMA-Adapter V2 achieve open-ended visual instruction-following with limited multi-modal data and minimal parameter updates?
RQ2Does an early fusion strategy improve the balance between image-text alignment and language instruction tasks?
RQ3How does joint training with disjoint parameters affect interference between vision-language alignment and instruction following?
RQ4What is the impact of integrating external expert vision systems on zero-shot multimodal reasoning?

主要发现

LLaMA-Adapter V2 surpasses its predecessor in language instruction-following and supports multi-turn dialogue.
An early fusion strategy effectively balances visual and language fine-tuning, enabling visual instruction learning without high-quality multi-modal data.
Disjoint-parameter joint training enables learning from image-text captions and language instructions without catastrophic interference.
Incorporating external expert systems at inference enhances image understanding without requiring costly joint vision-language pretraining.
With 14M trainable parameters, LLaMA-Adapter V2 achieves strong visual instruction capabilities while remaining highly parameter-efficient.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。