QUICK REVIEW

[논문 리뷰] MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang|arXiv (Cornell University)|2023. 08. 04.

Topic Modeling인용 수 59

한 줄 요약

MM-Vet은 LLM 기반 평가자를 사용하여 통합 비전-언어 태스크에서 대형 다중모달 모델을 평가하기 위한 벤치마크이며, 6가지 핵심 VL 능력에서 구축된 16개의 태스크를 포함합니다.

ABSTRACT

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.

연구 동기 및 목표

여섯 가지 핵심 비전-언어 능력(인식, OCR, 지식, 언어 생성, 공간 인식, 수학)을 정의한다.
이 능력의 조합이 필요한 16개의 통합 태스크를 구성하여 현실 세계의 시나리오를 모방한다.
다양한 유형의 질문에 걸쳐 열린 형식의 모델 출력을 점수 매기기 위해 LLM 기반 평가자를 도입한다.
대표적인 엔드투엔드 LMM과 LLM-도구 사용 시스템을 벤치마크하여 패러다임 간 강점과 약점을 밝힌다.
아키텍처, 데이터, 튜닝이 통합 다중모달 능력에 어떤 영향을 미치는지에 대한 통찰을 제공한다.

제안 방법

여섯 가지 핵심 VL 능력과 MM-Vet 태스크를 형성하는 16개의 통합을 정의한다.
열린 형식 출력을 포괄하는 200장의 이미지와 218개의 질문을 실제 정답 주석과 함께 구성한다.
GPT-4 기반의 소수 사례 평가자를 사용하여 샘플당 0–1의 정답 점수를 부여한다.
설명된 집계(S, S_c) 등을 사용하여 전체 점수와 능력별 점수를 계산한다.
Bard- 및 비-Bard 세트에서 엔드투엔드 튜닝된 LMM과 LLM-도구 사용 시스템을 비교한다.
비전 인코더, LLM 크기, 튜닝 데이터가 성능에 미치는 영향을 분석한다.

실험 결과

연구 질문

RQ1다양한 태스크 전반에서 통합된 VL 능력이 전체 LMM 성능과 어떤 관계가 있는가?
RQ2시스템 패러다임(엔드투엔드 대 LLM-도구 기반)은 능력 및 통합에서 강점이 어떻게 다른가?
RQ3비전 백본, 언어 모델, 튜닝 데이터가 MM-Vet 결과에 어떤 영향을 미치는가?
RQ4LLM 기반 평가자가 다양한 답변 스타일과 질문 유형에 걸친 통합된 확장 가능한 지표를 제공할 수 있는가?

주요 결과

LLaVA-13B (LLaMA-2)가 여러 모델 중에서 인식 점수에서 최고치를 달성하며, 더 큰 LLM과 비전 백본의 이점을 강조한다.
MM-ReAct-GPT-4는 외부 도구를 활용한 OCR 및 수학에서 뛰어나며, 구조화된 작업에 도구 사용의 가치를 시사한다.
LLaMA-Adapter v2-7B는 광범위한 튜닝 데이터로 인해 여러 능력에서 강력한 성능을 보인다.
MM-ReAct-GPT-4는 OCR, 공간 인식, 수학을 결합할 때 특히 여러 능력 통합에서 전반적으로 선두를 달린다.
Bard 세트 결과는 Bard가 이미지를 처리할 수 있는 부분에서 총점이 가장 높았고, MM-ReAct-GPT-4도 여러 카테고리에서 강력한 성능을 보였다.
LLM 기반 평가자는 열린 형식의 출력과 다양한 답변 스타일에 걸친 통합 점수를 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.