QUICK REVIEW

[論文レビュー] Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

Yinhong Liu, Han Zhou|arXiv (Cornell University)|Mar 25, 2024

Interpreting and Communication in Healthcare被引用数 6

ひとこと要約

この論文は LLM ベースの評価を対になった好みのランキング問題として再定義し、PairS を提案する。これは不確実性に guided な探索法で、候補テキストを効率的にランク付けし、人間の判断とよりよく一致させる。

ABSTRACT

Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human evaluation, revealing that existing calibration methods aimed at mitigating biases of LLMs are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally. PAIRS achieves state-of-the-art performance on representative evaluation tasks in long-form generations and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PAIRS benefits from calibration using debiased pairwise evaluations.

研究の動機と目的

Calibrate 後でもスコアベースの LLM 評価者が人間の判断とずれやすい理由を評価する。
RLHF に触発された対比較好み評価フレームワークを提案し、整合性を改善する。
PairS を開発し、対比較によるテキスト候補の効率的なランク付けを行う探索アルゴリズムを提供する。
LLM 評価者の推移性を性質として分析し、キャリブレーションが対比較法とどのように相互作用するかを検討する。
要約とオープンエンド生成タスクを横断して PairS のスケーラビリティと頑健性を示す。

提案手法

LLM と人間の間の尤度 p(y|s) の適合性を系統的に分析し、ずれを特定する。
評価をランク付け問題として定式化し、対比較の連結尤度 L_T を定義する。
PairS（Pair wise-preference Search）と PairS-greedy および PairS-beam の変種を導入し、対比較空間を効率的に探索する。
エントロピーを用いた不確実性ベースの剪定機構を組み込み、低不確実性の比較を導く。
N ログ N から O(N) の挙動へ anchored 後に大規模な評価セットを処理するための二段階スケーリング変種を提供する。
均一な人間の事前知識と周辺モデル事前を用いた Batch Calibration を PairS に適用し、バイアスを低減する。

Figure 1: Illustration of RLHF (left), direct scoring with LLM evaluators (middle), and pairwise preference search ( PairS ; right). Pairwise preference data have been utilized to train the reward model to align the LLM in RLHF. Leveraging this idea, PairS reframes the traditional scoring-based eval

実験結果

リサーチクエスチョン

RQ1対比較の好みデータは、直接のスコア付けより人間の判断と一致する評価を導けるのか。
RQ2不確実性誘導探索は対比較空間における最大尤度ランキングをどのように効率的に推定できるのか。
RQ3推移性とキャリブレーションは評価者の性能と信頼性にどのような影響を与えるのか。
RQ4PairS は大規模な評価セットや要約・物語生成のような実世界タスクに対してどれくらいスケールするのか。

主な発見

スコアベースの LLM 評価者のキャリブレーションは人間の判断と完全には一致せず、ずれはキャリブレーション後も残る。
PairS は要約と物語生成タスクで人間の判断との整合性を大幅に改善し、ゼロショット設定で最新のベースラインに近づくことが多い。
PairS-beam は一般に PairS-greedy より堅牢なランク付けを示し、特に推移性が低いモデルで相関が高く標準誤差が小さくなる。
不確実性誘導の剪定は探索空間を縮小し、ランク付けの品質を保ちつつ評価をスケーラブルにする。
LLM 評価者の推移性は評価品質と相関し、推移性が強いほど効率的で信頼性の高いランク付けにつながる。GPT-3.5 は実験で notably 一貫した挙動を示した。
キャリブレーションは PairS に寄与するが、モデルによって利得が異なる（小型モデルほど大きな利得が得られる場合が多い）。

Figure 2: LLM evaluations are misaligned with human judgements. The score histograms on evaluating the coherence in HANNA (Chhun et al., 2022 ) and SummEval (Fabbri et al., 2021 ) . We present the scores from ground human evaluations, LLMs, and LLMs after calibrations. The histograms can be interpre

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。