QUICK REVIEW

[論文レビュー] MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen|arXiv (Cornell University)|Feb 7, 2024

Natural Language Processing Techniques被引用数 5

ひとこと要約

モダリティを跨ぐLLMをジャッジするための MLLM-as-a-Judge ベンチマークを提案し、評価は Scoring Evaluation、Pair Comparison、Batch Ranking の3項目で、14 のデータセットと 4,414 の画像-指示ペアを用いて行う。GPT-4V が人間の判断に最も近く整合する一方、他のモデルはバイアスや幻覚を示す。

ABSTRACT

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: \url{https://mllm-judge.github.io/}.

研究の動機と目的

人間の嗜好に合わせたMLLMのためのマルチモーダル・ジャッジ基準を動機づけ、定義する。
さまざまなモダリティに跨る3つのジャッジタスクで、主流のMLLMをデータセットを厳選して評価する。
MLLM判断におけるバイアス、幻覚、一貫性の問題を特徴づける。
将来のMLLM-as-a-Judgeの改善を導くデータセットと洞察を提供する。

提案手法

14データセットから画像-指示ペアを収集し、4,414の画像-指示ペアを形成する。
6つの主流MLLMから回答を生成し、ジャッジ用の回答セットを形成する。
3つのタスク（Scoring Evaluation、Pair Comparison、Batch Ranking）を用いて、MLLMの判断を人間の注釈と照合・比較する。
ピアソン相関、正確さ/F1/再現率、正規化レーヴァンシュタイン距離を用いて人間の判断との整合性を分析する。
MLLM判断におけるバイアス（自己中心性、位置、長さ/冗長性）と幻覚を調査する。
思考過程の prompting（CoT）と視覚描写が判断性能に与える影響を評価する。

実験結果

リサーチクエスチョン

RQ1MLLMはマルチモーダル領域のジャッジとして効果的に機能できるか、彼らの評価は人間の嗜好とどれほど一致するか。
RQ2Scoring Evaluation、Pair Comparison、Batch Rankingの各タスクでMLLMは人間の判断とどれだけ整合するか。
RQ3MLLM判断に影響を与えるバイアスや幻覚は何か、プロンプト戦略で緩和できるか。
RQ4視覚入力（描写ではなく）を提供することでマルチモーダルなジャッジ性能は向上するか。
RQ5複数段階のCoTアプローチは判断性能を向上させるか、低下させるか。

主な発見

GPT-4V は設定を問わず人間の注釈への最も近い整合を一貫して達成し、他のMLLMを上回る。
MLLMはPair Comparisonでは人間の嗜好とよく一致するが、Scoring EvaluationとBatch Rankingでは顕著な乖離を示す。
幻覚とバイアス（自己中心的、位置、長さ）が蔓延しており、判断の信頼性に影響を与え、特にBatch Rankingで顕著。
視覚入力と視覚描写を組み合わせると判断性能が大幅に向上し、時には視覚情報なしのベースラインを凌ぐ。
3段階のCoTは幻覚を減らすが、人間の嗜好との整合性を一貫して向上させるわけではなく、場合によって判断品質を損なうことがある。
特定のモデルでスケーリング効果の証拠があり、より大型のLLMは特定のタスクでより強い性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。