QUICK REVIEW

[論文レビュー] MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan|arXiv (Cornell University)|Jul 12, 2023

Multimodal Machine Learning Applications被引用数 32

ひとこと要約

MMBenchはCircularEvalとChatGPTベースの選択肢抽出を用いて、14モデルに渡る20の細かな能力を評価する、包括的で客観的な多能力ベンチマークをVision-language modelsに提供します。

ABSTRACT

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area. The evalutation code of MMBench has been integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

研究の動機と目的

視覚言語モデルの能力の階層的で細分化された分類法（知覚と推論）を定義する。
約3000問の多肢選択問題からなる大規模で多様なデータセットを作成し、20の葉の能力を網羅する。
コストを抑えつつ評価の堅牢性を高めるCircularEvalを導入する。
フリーフォームのVLM出力を扱う普遍的な選択肢抽出器としてChatGPTを活用する。
14の著名な視覚言語モデルをベンチマークして能力ギャップを分析し設計の指針を提供する。

提案手法

20の葉の能力をもつ3レベルの能力分類（L-1からL-3）を作成する。
画像入りの2,974件のQ/Aアイテムを、多様なソースから2–4つの選択肢とともに収集する。
CircularEvalプロトコル（パスを跨いで選択肢を回転させる）を用いてフリーフォーム予測を単一選択ラベルに変換する。
モデル出力を対応する選択肢ラベルへマッピングするためにChatGPTを使用し、解析に失敗した場合は人間またはランダムなラベリングへフォールバックする。
開発用/検証用のスプリットを提供する。開発データの正解は公開され、テストは非公開。評価はサーバー経由。
14のLVLMを評価（ほとんどがサブ10B）し、能力別のパフォーマンスを分析する。利用可能な場合にはより大きなバリアントと比較する。）

実験結果

リサーチクエスチョン

RQ1細分化された階層型ベンチマークは、LVLMの多様な知覚と推論能力を信頼性高く定量化できるか？
RQ2CircularEvalは、従来の単一パス評価と比べて単一選択アセスメントの堅牢性を向上させるか？
RQ3ChatGPTは、フリーフォームのモデル出力を事前定義された選択肢に変換する信頼できる普遍的な選択肢抽出器か？
RQ4現在のLVLMの20の葉の能力全体における利点と欠点は何か？
RQ5さまざまなモデルアーキテクチャやデータ戦略は、能力次元ごとの性能にどのように影響しますか？

主な発見

視覚言語モデル	総合	LR	AR	RR	FP-S	FP-C	CP
OpenFlamingo	4.3%	6.7%	11.4%	3.3%	2.5%	1.6%	1.5%
OpenFlamingo v2	5.7%	11.4%	12.8%	1.4%	5.5%	0.8%	4.0%
MMGPT	16.0%	1.1%	23.9%	20.7%	18.3%	5.2%	18.2%
MiniGPT-4	23.0%	13.6%	32.9%	8.9%	28.7%	11.2%	28.3%
InstructBLIP	36.0%	14.2%	46.3%	22.6%	37.0%	21.4%	49.0%
VisualGLM	38.1%	10.8%	44.3%	35.7%	43.8%	23.4%	47.3%
LLaVA	38.7%	16.7%	48.3%	30.4%	45.5%	32.4%	40.6%
LLaMA-Adapter	41.2%	11.7%	35.3%	29.6%	47.5%	38.6%	56.4%
μ-G2PT	43.2%	13.3%	38.8%	40.9%	46.5%	38.6%	58.1%
mPLUG-Owl	49.4%	16.7%	53.2%	47.8%	50.2%	40.7%	64.1%
Otter-I	51.4%	32.5%	56.7%	53.9%	46.8%	36.4%	60.6%
Kosmos-2	59.2%	46.7%	55.7%	43.5%	65.6%	47.9%	70.4%
Shikra	58.8%	25.8%	56.7%	58.3%	57.2%	57.9%	75.8%
PandaGPT	33.5%	10.0%	38.8%	23.5%	27.9%	35.2%	48.3%
MiniGPT-4-13B	42.3%	20.8%	50.7%	30.4%	49.5%	26.2%	50.7%
InstructBLIP-13B	44.0%	19.1%	54.2%	34.8%	47.8%	24.8%	56.4%

MMBenchは、20の葉の能力にわたる2,974件のデータサンプルをカバーし、能力間でバランスの取れた分布をもつ。
CircularEvalは、バイアスを大幅に低減し、テストしたモデル間でVanillaEvalよりもより堅牢な比較を生む。
ChatGPTベースの選択肢抽出は、人間の判断との高い整合性を達成する（GPT-3.5/GPT-4で87.0–87.2%）；曖昧な出力の解析では厳密一致を上回る。
14の LVLM の中で、能力によってパフォーマンスは大きく異なり、CircularEvalではVanillaEvalと比べて顕著な低下が見られ、堅牢性を生み出す一方で生データの精度には対比的。
より大きな、または異なるアーキテクチャ設計が、指示遵守や全体的な性能を保証するとは限らない。葉ごとの詳細な傾向は、特定の強み（例：跨インスタンスの知覚、論理/関係推論）を示す。
devでのグラウンドトゥルース評価は公開されており、テストは評価サーバーへの提出を必要とし、公平なモデル間比較を可能にする。）

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。