QUICK REVIEW

[論文レビュー] Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Hui Wei, Shenghua He|arXiv (Cornell University)|Aug 23, 2024

Artificial Intelligence in Law被引用数 6

ひとこと要約

この論文は、整合性タスクにおけるジャッジとしてのLLMsを評価する explainable metrics を開発し、多様なプロンプトテンプレートの影響を分析し、TL;DR 要約と HH-RLHF-Helpfulness データセットで検証されたフレームワークを提供します。

ABSTRACT

LLM-as-a-Judge has been widely applied to evaluate and compare different LLM alignmnet approaches (e.g., RLHF and DPO). However, concerns regarding its reliability have emerged, due to LLM judges' biases and inconsistent decision-making. Previous research has developed evaluation frameworks to assess reliability of LLM judges and their alignment with human preferences. However, the employed evaluation metrics often lack adequate explainability and fail to address LLM internal inconsistency. Additionally, existing studies inadequately explore the impact of various prompt templates when applying LLM-as-a-Judge methods, leading to potentially inconsistent comparisons between different alignment algorithms. In this work, we systematically evaluate LLM-as-a-Judge on alignment tasks by defining more theoretically interpretable evaluation metrics and explicitly mitigating LLM internal inconsistency from reliability metrics. We develop an open-source framework to evaluate, compare, and visualize the reliability and alignment of LLM judges, which facilitates practitioners to choose LLM judges for alignment tasks. In the experiments, we examine effects of diverse prompt templates on LLM-judge reliability and also demonstrate our developed framework by comparing various LLM judges on two common alignment datasets (i.e., TL;DR Summarization and HH-RLHF-Helpfulness). Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.

研究の動機と目的

精度、 flipping noise、位置バイアス、長さバイアスを形式化することで、LLMジャッジの評価指標の解釈可能性を向上させる。
LLMジャッジの信頼性を内部的一貫性から分離し、評価の信頼性を向上させる。
異なるプロンプトテンプレートがLLMジャッジの信頼性と人間の嗜好との整合性に与える影響を評価する。
モデルとテンプレートを横断してLLMジャッジを評価・比較・可視化する一般的なフレームワークを提供する。
体系的なランキングに基づいて、特定の整合性タスクに適したLLMジャッジの選択に関する指針を提供する。

提案手法

応答順が入れ替えられたデータを考慮した統一フレームワーク内で、Acc_both および Acc_random の精度指標を定義・計算する。
flipping noiseをモデル化しノイズ除去を行い、LLMの自己不整合と位置バイアス・長さバイアスなどのバイアスを分離する。
応答順が入れ替えられたときの整合性の差分として位置バイアスを定量化し、デ-noised 推定を計算する。
長さバイアスを、長い応答を好む傾向と短い応答を好む傾向の相対的な差として定量化し、 flipping noise のノイズ除去を行う。
データサンプリング、LLMジャッジ、指標計算、可視化を伴う、体系的な比較のための評価フレームワークを開発する。

実験結果

リサーチクエスチョン

RQ1異なるプロンプトとモデルに対して、LLMジャッジは人間の評価者の代替としてどれだけ信頼できるか？
RQ2プロンプトテンプレートがLLMジャッジの精度、位置バイアス、長さバイアスにどのように影響するか？
RQ3flipping noise を真のバイアスから分離して、LLMジャッジのより解釈可能な信頼性指標を得ることはできるか？
RQ4TL;DR や HH-RLHF-Helpfulness のような一般的なデータセットに対する、LLMジャッジと人間の嗜好との相対的な整合性はどの程度か？
RQ5与えられたデータセットで Acc_both の下で最も性能が高い LLMジャッジ（モデル + テンプレート）はどれか、そしてそれらをどのようにランキングすべきか？

主な発見

プロンプトテンプレートはデータセット全体で LL Mジャッジの精度に大きな影響を与える。
LLMジャッジは TL;DR および HH-RLHF-Helpfulness データの両方で人間の評価者との整合性がまあまあである。
テスト対象のジャッジ間で、精度と位置バイアスの間に顕著な負の相関がある。
全てのテスト済みLLMジャッジは長い応答へ偏るバイアスを示し、特にマルチターン会話で顕著である。
GPT-4o および GPT-4o-mini は、精度の点で一般的に GPT-3.5-turbo を上回り、プロンプトテンプレートの効果は様々である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。