QUICK REVIEW

[논문 리뷰] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han|arXiv (Cornell University)|2023. 04. 28.

Multimodal Machine Learning Applications인용 수 117

한 줄 요약

LLaMA-Adapter V2는 편향 조정, 조기 융합, 그리고 분리 매개변수의 공동 학습을 통해 소량의 이미지-텍스트 및 지시 데이터만으로도 개방형 시각 지시 수행을 가능하게 하며, 필요 시 전문 시각 시스템을 통합한다. 이는 약 14M개의 학습 가능 매개변수를 추가하여(LLaMA의 약 0.04%에 해당) 강력한 멀티모달 및 언어 지시 성능을 달성한다.

ABSTRACT

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

연구 동기 및 목표

대규모 멀티모달 데이터 없이 지시를 따라 시각 모델을 구축하도록 동기를 부여한다.
고정된 LLM에 시각 정보를 융합하기 위한 매개변수 효율적인 전략을 소개한다.
이미지-텍스트 정렬과 언어 지시 학습을 분리하는 공동 학습 체계를 제안한다.
외부 전문가 비전 시스템과의 통합을 가능하게 하여 시각 이해를 향상시킨다.

제안 방법

정규화 계층의 해제를 유지한 채 모든 선형 모듈에 학습 가능한 바이어스와 스케일을 추가하여 선형 계층의 바이어스 조정을 수행한다.
분리된 매개변수 그룹으로의 공동 학습: 이미지-텍스트 캡션 데이터에 대해 시각 투영과 조기 0으로 초기화된 어텐션을 학습하고, 지시 데이터에 대해 후기 적응 프롬프트, 게이팅 및 추가 LLaMA 매개변수를 학습한다.
레이어 전체가 아닌 초기 LLM 계층에 시각 토큰을 주입하여 시각 지식을 조기 융합한다.
추가 학습 없이 추론 시 캡션화/OCR/탐지 등 전문가 모델의 도입으로 이미지 이해를 향상시킨다.
52K 이미지-텍스트 캡션(COCO) 및 567K 캡션 데이터, 여기에 대화 데이터 80K를 포함하여 7B–65B LLaMA 백본으로 학습한다.
매개변수 발자국은 중간 규모로: 약 14M 학습 가능한 매개변수, 전체 모델의 약 0.04%에 해당한다.

실험 결과

연구 질문

RQ1제한된 멀티모달 데이터와 최소한의 매개변수 업데이트로 LLaMA-Adapter V2가 개방형 시각 지시 수행을 달성할 수 있는가?
RQ2조기 융합 전략이 이미지-텍스트 정렬과 언어 지시 작업 간의 균형을 개선하는가?
RQ3분리된 매개변수로의 공동 학습이 비전-언어 정렬과 지시 수행 사이의 간섭에 어떤 영향을 미치는가?
RQ4외부 전문가 비전 시스템의 통합이 제로샷 멀티모달 추론에 미치는 영향은 무엇인가?

주요 결과

LLaMA-Adapter V2는 언어 지시 수행에서 선행 모델을 능가하고 다회 대화를 지원한다.
조기 융합 전략은 시각 및 언어 미세 조정을 효과적으로 균형 있게 하여 고품질 멀티모달 데이터 없이도 시각 지시 학습을 가능하게 한다.
분리 매개변수 공동 학습은 이미지-텍스트 캡션과 언어 지시에서 재앙적 간섭 없이 학습을 가능하게 한다.
추론 시 외부 전문가 시스템의 도입은 비용이 큰 공동 비전-언어 사전학습을 필요로 하지 않으면서 이미지 이해를 향상시킨다.
14M 학습 가능한 매개변수로 LLaMA-Adapter V2는 강력한 시각 지시 능력을 달성하는 동시에 매개변수 효율성을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.