QUICK REVIEW

[論文レビュー] MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang|arXiv (Cornell University)|Aug 4, 2023

Topic Modeling被引用数 59

ひとこと要約

MM-Vetは、LLMベースの評価者を用いて統合視覚言語タスクに対して大型マルチモーダルモデルを評価するベンチマークで、6つの中核VL能力から構築された16のタスクを含む。

ABSTRACT

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.

研究の動機と目的

6つの中核VL能力を定義する（認識、OCR、知識、言語生成、空間認識、数学）。
これらの能力の組み合わせを必要とする16の統合タスクを構築し、現実世界のシナリオを模倣する。
多様な質問タイプにわたるオープンエンドのモデル出力を採点するLLMベースの評価者を導入する。
代表的なエンドツーエンドLMMとLLMツール使用システムをベンチマークし、パラダイム間の長所と短所を明らかにする。
アーキテクチャ、データ、チューニングが統合マルチモーダル能力に与える影響について洞察を提供する。

提案手法

六つの中核VL能力と16の統合を定義してMM-Vetタスクを形成する。
オープンエンド出力を含むground-truthアノテーションを横断する200枚の画像と218の質問を組み立てる。
GPT-4ベースのfew-shot評価者を用いて各サンプルに0–1の正解スコアを割り当てる。
described aggregation（例：S, S_c）を用いて全体スコアおよび能力別スコアを計算する。
Bardセットと非Bardセットの両方でエンドツーエンドの調整済みLMMとLLMツール使用システムを比較する。
視覚エンコーダ、LLMサイズ、チューニングデータが性能に与える影響を分析する。

実験結果

リサーチクエスチョン

RQ1統合VL能力は、さまざまなタスクにおける全体的なLMMパフォーマンスとどのように関連していますか？
RQ2エンドツーエンドとLLMツールベースのシステムなど、システムパラダイムは能力と統合においてどのような強みの違いを示しますか？
RQ3視覚バックボーン、言語モデル、チューニングデータはMM-Vetの結果にどのような影響を及ぼしますか？
RQ4LLMベースの評価者は、多様な回答スタイルと質問タイプに対して統一的で拡張可能な指標を提供できますか？

主な発見

LLaVA-13B (LLaMA-2) は複数のモデルの中で最上位の認識スコアを達成し、より大きなLLMと視覚バックボーンの利点を浮き彫りにしています。
MM-ReAct-GPT-4 は外部ツールを活用したOCRと数学で優れており、構造化タスクにおけるツール使用の価値を示しています。
LLaMA-Adapter v2-7B は、広範なチューニングデータにより、いくつかの能力で強力な性能を示します。
MM-ReAct-GPT-4 は、特にOCR、空間認識、数学を組み合わせた場合に、複数の能力統合で総じてリードします。
Bardセットの結果は、画像処理が可能な subset で Bard が総合得点最高を達成したことを示し、MM-ReAct-GPT-4 もいくつかのカテゴリで高いパフォーマンスを示します。
LLMベースの評価者により、オープンエンドの出力や多様な回答スタイルを横断して統一的なスコアリングが可能になります。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。