QUICK REVIEW

[논문 리뷰] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Gen Luo, Yiyi Zhou|arXiv (Cornell University)|2023. 05. 24.

Multimodal Machine Learning Applications인용 수 45

한 줄 요약

본 논문은 Mixture-of-Modality Adaptation (MMA)를 제안하여 Lightweight adapters를 사용해 LLM을 비전-언어 VL 태스크에 효율적으로 적응시키고 엔드투엔드 학습을 가능케 한다. 이를 통해 LaVIN을 소개하며, VisLang 명령 모델 built on LLaMA로, 학습 비용을 대폭 낮추면서 경쟁력 있는 성능을 달성한다.

ABSTRACT

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at https://luogen1996.github.io/lavin.

연구 동기 및 목표

대규모 VL 사전 학습 없이 대형 언어 모델(LLMs)에 대한 합리적 비용의 VL 적응을 고무한다.
경량 어댑터와 동적 라우팅 메커니즘을 통해 이미지 인코더와 LLM을 연결하기 위한 MMA를 도입한다.
작은 파라미터 발자국으로 엔드투엔드 학습을 입증하고, 멀티모달 과학 QA 및 대화 태스크에서 검증한다.

제안 방법

Mixture-of-Modality Adapter (MM-Adapter)를 도입하여 모달리티 토큰과 softmax 기반 라우터를 사용해 단일모달 및 다중모달 적응 간의 동적 라우팅을 수행한다.
Mixture-of-Modality Training (MMT)을 정의하여 백본 LLM과 이미지 인코더를 고정하고 어댑터만 엔드투엔드 목적함수로 미세조정한다.
MMA를 LLaMA에 적용하여 LaVIN을 만들고, 이미지 인코더로 CLIP-ViT를 사용하며 시각 특징으로 여섯 개의 [cls] ViT 토큰을 사용한다.
작은 파라미터 발자국(예: 3–5M trainable params)을 가진 시각 네크를 사용하고 텍스트-전용 및 텍스트-이미지 지시의 혼합으로 학습한다.
텍스트-전용 및 이미지-텍스트 지시 간의 자동 전환을 가능하게 하며 멀티모달 LLM을 엔드투엔드 학습으로 공동 최적화한다.

실험 결과

연구 질문

RQ1MMA가 significantly reduced training cost와 parameters로 경쟁력 있는 VL 명령학습 성능을 달성할 수 있는가?
RQ2VL 사전학습 없이 VL 이해를 얻으면서 LaVIN이 NLP 능력을 유지하는가?
RQ3추론 중 텍스트-전용과 이미지-텍스트 지시 사이의 자동 전환을 MMA가 어떻게 촉진하는가?
RQ4어댑터 크기, 이미지 인코더의 강도, 그리고 LLM 규모가 ScienceQA 및 멀티모달 대화 성능에 미치는 영향은 무엇인가?

주요 결과

Method	#T-Params	NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12	Avg.
LaVIN-7B	3.8M	89.25	94.94	85.24	88.51	87.46	88.08	90.16	88.07	89.41
LaVIN-13B	5.4M	90.32	94.38	87.73	89.44	87.65	90.31	91.19	89.26	90.50

MMA를 갖춘 LaVIN은 SOTA 멀티모달 LLM 대비 경쟁력 있는 성능을 달성하면서 학습 시간과 저장 공간을 대폭 감소시킨다(예: ScienceQA에서 3.8M trainable parameters로 1.4시간).
LaVIN-13B는 ScienceQA 테스트 세트에서 약 90.83 정확도를 달성하고 5.4M 파라미터 예산으로 여러 파라미터 효율적 기준선을 능가한다.
MMT는 제거군 중 가장 큰 향상을 보이며, 더 강한 이미지 인코더와 공동 최적화를 도입했을 때 평균 정확도 향상 최대 +4.69.
이미지 인코더와 LLM의 공동 최적화, 더 강한 이미지 인코더(ViT-L/14)로 명확한 정확도 향상을 얻으며 엔드투엔드 VL 적응의 이점을 검증한다.
COCO 캡션에서 LaVIN은 훨씬 적은 사전 학습 데이터와 갱신된 파라미터로 경쟁력 있는 CIDEr 점수를 달성하며 BLIP-2와 LLaVA에 비해 학습 비용이 크게 낮다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.