QUICK REVIEW

[論文レビュー] Style Over Substance: Evaluation Biases for Large Language Models

Minghao Wu, Alham Fikri Aji|arXiv (Cornell University)|Jul 6, 2023

Topic Modeling被引用数 9

ひとこと要約

本論文は人間とLLM審査員によるLLM出力の評価バイアスを明らかにし、テキストを別の次元で評価するMulti-Elo Rating System（MERS）を提案。GPT-4の評価における事実正確性を改善する一方で、クラウドソース評価には効果が限定的。

ABSTRACT

As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human judgment, is gaining more popularity. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation.

研究の動機と目的

クラウドソース・専門家・LLM審査員が異なるモデルの出力をどのように評価するかを調査する。
評価プロセスにおける長さバイアスや順序効果を含むバイアスを特定する。
エロスタイルの評価を用いた多次元評価フレームワーク（MERS）を提案・検証する。

提案手法

GPT-4プロンプトを用いて回答長、言語能力、事実正確性を変化させて12のモデル設定を生成する。
クラウドソースのアノテータ、専門家アノテータ、GPT-4、Claude-1から40問（5280比較）でペア評価を収集する。
異なる審査タイプ下でのエロ評価を計算し、アノテータ間一致度（コーエンのカッパ）を分析する。
Accuracy・Helpfulness・Languageの三次元でMulti-Elo Rating System（MERS）を独立して評価する。
単一スコア評価と多次元評価アプローチを比較し、長さ・順序・事実確認などのバイアス源を分析する。

実験結果

リサーチクエスチョン

RQ1人間とLLMの評価は同等にモデル出力を評価するのか、それとも系統的な評価バイアスが存在するのか。
RQ2回答長、言語能力、事実正確性といった要因が審査タイプ別の判断にどう影響するのか。
RQ3評価を複数の次元に分離する（MERS）ことでLLM評価の質と信頼性は向上するのか。
RQ4クラウドソース評価は専門家やLLM審査よりLLMベンチマークに信頼性があるのか。
RQ5多次元のエロベースのフレームワークは単一の総合スコアより真の出力品質をより適切に反映できるのか。

主な発見

人間（クラウド・専門家）はLLM審査員より躊躇し、事実確認が不十分な傾向がある。一方LLMはより確信的で長い回答を好む。
長いテキストは人間・LLM双方に好まれ、要約的で事実上正確な出力を薄める可能性がある。
クラウドソースのアノテータは意思決定が弱く事実確認が弱め、専門家はより強いが完全ではない。LLM審査はエラーに気づくが一貫性がない。
単一の統一スコアだけでは出力品質を十分に捉えられない。次元評価はAccuracy・Helpfulness・Languageのニュアンスを明らかにする。
MERSはGPT-4ベースの評価における事実正確性を著しく向上させる一方、クラウドソース評価では改善効果が限定的。
アノテータ間一致度はGPT-4とClaude-1では中程度だが、他の組み合わせではわずかな程度であり、人間判断の多様性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。