QUICK REVIEW

[논문 리뷰] Large Language Models are Visual Reasoning Coordinators

Liangyu Chen, Bo Li|arXiv (Cornell University)|2023. 10. 23.

Multimodal Machine Learning Applications인용 수 14

한 줄 요약

Cola는 코디네이터 LLM을 사용해 여러 비전-언어 모델을 융합하여 시각적 추론을 수행하며, 지시 튜닝( Cola-FT ) 및 컨텍스트 내 학습( Cola-Zero )을 통해 여러 VQA 및 시각적 추론 벤치마크에서 최첨단 결과를 달성합니다.

ABSTRACT

Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a large language model (LLM) can efficiently coordinate multiple VLMs by facilitating natural language communication that leverages their distinct and complementary capabilities. Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.

연구 동기 및 목표

다양한 VLM으로 시각적 추론을 자극하고 이들의 보완적 강점을 활용한다.
VLM 캡션과 그럴듯한 답변을 통합하기 위한 LLM 기반 코디네이터 패러다임인 Cola를 제안한다.
Cola-FT (지시 튜닝된 LLM)가 여러 데이터셋에서 SOTA를 달성한다.
Cola-Zero (인-context 학습)가 미세 조정 없이도 경쟁력 있는 zero/few-shot 성능을 제공한다.
코디네이터가 개별 VLM 기능을 활용하도록 학습하는 방법에 대한 분석을 제공한다.

제안 방법

시각 언어 모델로 OFA와 BLIP를 사용하여 이미지 캡션과 그럴듯한 답변을 생성한다.
이미지-질문 맥락(context)로 VLM에 프롬프트를 제공하여 캡션 c_i(v)와 그럴듯한 답변 â_i(v,q)을 얻는다.
VLM 캡션과 그럴듯한 답변에 VLM 라벨을 포함하는 joint Prompt(v,q)을 구성하고 이를 LLM 코디네이터에 공급한다.
다음-토큰 예측과 교사 강제(teacher forcing)로 코디네이터 LLM( Cola-FT )를 미세조정하되 VLM은 고정한다.
또는 매개변수 업데이트 없이 프롬프트의 k-shot 시연으로 인-context 학습을 사용하여 Cola-Zero를 가능하게 한다.

실험 결과

연구 질문

RQ1코디네이터 LLM이 다수의 VLM 출력물을 효과적으로 융합하여 시각적 추론을 개선할 수 있는가?
RQ2지시 튜닝(Cola-FT)과 인-context 학습(Cola-Zero)가 성능, 효율성 및 확장성 측면에서 어떻게 비교되는가?
RQ3VLM 출력의 어떤 측면(캡션 vs. 그럴듯한 답변)이 코디네이터의 결정에 가장 크게 기여하는가?
RQ4Cola가 다른 시각적 추론 작업 및 데이터 세트 간에 얼마나 잘 전달되는가?

주요 결과

Cola-FT는 A-OKVQA, OK-VQA, e-SNLI-VE 및 VSR를 포함한 여러 데이터셋에서 최첨단 성능을 달성한다.
Cola-Zero는 미세조정 없이 경쟁력 있는 제로샷 및 Few-shot 성능을 보여주며, 인-context 학습을 활용한다.
VLM 캡션과 그럴듯한 답변을 모두 사용하는 것이 강력한 신호를 제공하며, 그럴듯한 답변이 코디네이터를 안내하는 데 특히 영향력이 있다.
코디네이터 LLM은 각 VLM의 고유한 기능을 구분하고 활용하는 법을 배우며, 단일 VLM 베이스라인 및 간단한 앙상블보다 추론을 개선한다.
더 많은 VLM으로 Cola를 확장하면 상당한 이득이 발생; Cola-FT는 지시 튜닝의 이점을 받는 반면 Cola-Zero는 더 큰 LLM 규모에서 출현하는 능력을 보인다.
정성적 분석 및 주의 시각화는 코디네이터가 프롬프트를 해석하고 VLM 출력 중에서 선택하는 방식을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.