QUICK REVIEW

[论文解读] Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

Yinhong Liu, Han Zhou|arXiv (Cornell University)|Mar 25, 2024

Interpreting and Communication in Healthcare被引用 6

一句话总结

本文将基于大型语言模型的评估重新框定为成对偏好排序问题，并提出 PairS，一种以不确定性引导的搜索方法，可高效对候选文本进行排序，以更好地与人工判断对齐。

ABSTRACT

Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human evaluation, revealing that existing calibration methods aimed at mitigating biases of LLMs are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally. PAIRS achieves state-of-the-art performance on representative evaluation tasks in long-form generations and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PAIRS benefits from calibration using debiased pairwise evaluations.

研究动机与目标

评估为何基于分数的 LLM 评估者在经过校准后仍与人类判断不一致。
提出一个受 RLHF 启发的成对偏好评估框架，以改善对齐。
开发 PairS，一种以不确定性引导的搜索算法，通过成对比较高效对候选文本进行排序。
分析成对性作为 LLM 评估者的一个性质，以及校准如何与成对方法相互作用。
展示 PairS 在摘要与开放式生成任务中的可扩展性与鲁棒性。

提出的方法

系统性分析校准有效性，并识别 LLMs 与人类之间对数似然 p(y|s) 的错配。
将评估形式化为排序问题，并为成对比较定义可传递的似然 L_T。
引入 PairS（Pair wise-preference Search）及 PairS-greedy 和 PairS-beam 变体，以高效搜索成对比较的空间。
纳入基于不确定性的裁剪机制，利用熵来引导对低不确定性比较的裁剪。
提供一个两阶段扩展变体，在锚定后将大规模评估集从 N log N 转为 O(N) 行为。
对 PairS 应用批量校准，使用均匀的人类先验和边际模型先验以降低偏差。

Figure 1: Illustration of RLHF (left), direct scoring with LLM evaluators (middle), and pairwise preference search ( PairS ; right). Pairwise preference data have been utilized to train the reward model to align the LLM in RLHF. Leveraging this idea, PairS reframes the traditional scoring-based eval

实验结果

研究问题

RQ1成对偏好数据是否能比直接打分使 LLM 评估更接近人类判断？
RQ2基于不确定性引导的搜索如何在成对比较空间中高效估计最大似然排序？
RQ3传递性和校准对 LLM 评估者的性能与可靠性有何影响？
RQ4PairS 在大规模评估集以及摘要和故事生成等现实任务中的可扩展性如何？

主要发现

对基于分数的 LLM 评估者的校准并未完全与人类判断对齐；校准后仍存在错配。
PairS 在摘要和故事生成任务中显著提升与人类判断的一致性，常在零-shot 设置下接近或达到最新基线。
PairS-beam 一般比 PairS-greedy 得到更稳健的排序（更高相关性、较低标准误），特别是对于传递性较低的模型。
基于不确定性的裁剪减少了搜索空间并保持排序质量，从而实现可扩展评估。
LLM 评估者的传递性与评估质量相关；传递性越强，排序越高效、可靠，GPT-3.5 在实验中表现出显著的一致性。
校准对 PairS 有帮助，但其收益因模型而异（对较小的模型比对高性能模型收益更多）。

Figure 2: LLM evaluations are misaligned with human judgements. The score histograms on evaluating the coherence in HANNA (Chhun et al., 2022 ) and SummEval (Fabbri et al., 2021 ) . We present the scores from ground human evaluations, LLMs, and LLMs after calibrations. The histograms can be interpre

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。