QUICK REVIEW

[논문 리뷰] Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Juncheng Li, Kaihang Pan|arXiv (Cornell University)|2023. 08. 08.

Multimodal Machine Learning Applications인용 수 11

한 줄 요약

제안 VPG-C, 경량 시각 프롬프트 제너레이터 컴플리트 모듈로, 합성 판별 학습 전략으로 다중모달 LLM이 제로샷 시범 지침을 따를 수 있게 하고, 평가를 위한 DEMON 벤치마크를 도입한다.

ABSTRACT

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where the VPG-generated tokens of images are fed into a frozen LLM to generate the corresponding captions. However, this image-captioning based training objective inherently biases the VPG to concentrate solely on the primary visual contents sufficient for caption generation, often neglecting other visual details. This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task. To address this issue, we introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C), which can infer and complete the missing details essential for comprehending demonstrative instructions. Further, we propose a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions. As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and OwlEval benchmarks also demonstrate the superiority of VPG-C. Our benchmark, code, and pre-trained models are available at https://github.com/DCDmllm/Cheetah.

연구 동기 및 목표

기본 콘텐츠를 넘어 서로 뒤섞인 다중모달 시범을 이해해야 할 모델의 필요성을 촉구한다.
시연 지시에 대한 누락된 시각적 세부 정보를 추론하고 완성하기 위한 경량의 일반적인 VPG-C 모듈을 도입한다.
감독된 시범 지시 데이터가 필요하지 않은 합성 판별 학습 전략을 개발한다.
MLLMs에서 시범 지시 이해를 평가하기 위한 포괄적 벤치마크 DEMON을 만들고 공개한다.

제안 방법

고정된 LLM(Vicuna-7B)과 비전 인코더(EVA-CLIP), Q-Former를 기본 VPG로 사용한다.
VPG-C는 중간 LLM 출력으로부터 지시-특정 가이던스를 도출하고 잔여 시각적 프롬프트를 생성한다.
잔여 프롬프트는 점프 연결을 통해 다시 결합되어 다중모달 표현을 보강한다.
합성 판별 학습을 통해 VPG-C 매개변수만(모델의 0.09%) 학습한다.
합성 학습 편집은 교차 주의 맵에 의해 이미지 영역을 무시하고, 합성 이미지 쌍을 만들며, 차이점을 설명하도록 모델을 학습시킨다.

실험 결과

연구 질문

RQ1VPG-C가 라벨링된 시연 데이터 없이 제로샷으로 제시된, 서로 뒤섞인 다중모달 지시를 이해하게 할 수 있는가?
RQ2전통적인 VPGs와 비교하여 합성 판별 학습이 누락된 시각적 세부 정보를 처리하는 데 개선을 가져오는가?
RQ3기존 다중모달 벤치마크(MME, OwlEval)에서의 VPG-C 성능과 새로 도입된 DEMON 벤치마크에서의 성능은 어떤가?
RQ4최상의 성능을 얻기 위해 LLM/VPG 파이프라인의 어디에 가이던스와 잔여 세부 정보를 주입해야 하는가?

주요 결과

VPG-C는 DEMON 작업 카테고리 전반에서 기존 다중모달 LLM을 지속적으로 능가한다.
VPG-C 모듈과 함께한 합성 학습 데이터가 이미지-캡션 데이터만으로 학습하는 것보다 상당한 이점을 낸다.
가벼운 매개변수 모듈 6.3M만 조정하면서도 상당한 개선을 이룬다.
제로샷 평가에서 MME와 OwlEval과 같은 추가 벤치마크에서 강한 성능을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.