QUICK REVIEW

[论文解读] Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Hui Wei, Shenghua He|arXiv (Cornell University)|Aug 23, 2024

Artificial Intelligence in Law被引用 6

一句话总结

本论文开发可解释的度量来评估LLMs作为对齐任务中的评审，在分析多样化提示模板的影响，并提供一个在 TL;DR 摘要和 HH-RLHF-Helpfulness 数据集上验证的框架。

ABSTRACT

LLM-as-a-Judge has been widely applied to evaluate and compare different LLM alignmnet approaches (e.g., RLHF and DPO). However, concerns regarding its reliability have emerged, due to LLM judges' biases and inconsistent decision-making. Previous research has developed evaluation frameworks to assess reliability of LLM judges and their alignment with human preferences. However, the employed evaluation metrics often lack adequate explainability and fail to address LLM internal inconsistency. Additionally, existing studies inadequately explore the impact of various prompt templates when applying LLM-as-a-Judge methods, leading to potentially inconsistent comparisons between different alignment algorithms. In this work, we systematically evaluate LLM-as-a-Judge on alignment tasks by defining more theoretically interpretable evaluation metrics and explicitly mitigating LLM internal inconsistency from reliability metrics. We develop an open-source framework to evaluate, compare, and visualize the reliability and alignment of LLM judges, which facilitates practitioners to choose LLM judges for alignment tasks. In the experiments, we examine effects of diverse prompt templates on LLM-judge reliability and also demonstrate our developed framework by comparing various LLM judges on two common alignment datasets (i.e., TL;DR Summarization and HH-RLHF-Helpfulness). Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.

研究动机与目标

通过形式化准确性、翻转噪声、位置偏见和长度偏置来提升对LLM评审的评估指标的可解释性。
将LLM评审的可靠性与它们的内部不一致性分离，以提升评估的可靠性。
评估不同提示模板如何影响LLM评审的可靠性和与人类偏好的一致性。
提供一个通用框架，用于评估、比较和可视化不同模型与模板下的LLM评评审。
基于系统性排名，为特定对齐任务选择合适的LLM评审提供指导。

提出的方法

在一个统一框架内定义并计算准确性指标 Acc_both 与 Acc_random，考虑数据的对换响应顺序。
对翻转噪声进行建模与去噪，以将LLM自我不一致与如位置偏见和长度偏见等偏差区分开。
将位置偏差量化为在交换响应顺序时对齐程度的差异，并计算去噪估计值。
将长度偏差量化为相对于较长与较短响应的偏好趋势，并对翻转噪声进行去噪处理。
开发一个包含数据采样、LLM评审、指标计算和可视化的评估框架，用于系统性比较。

实验结果

研究问题

RQ1在不同提示和模型下，LLM评审作为人类评估者代理在对齐任务中的可靠性如何？
RQ2提示模板如何影响LLM评审的准确性、位置偏见和长度偏见？
RQ3我们能否将翻转噪声与真实偏见区分开，以获得更易解释的LLM评审可靠性指标？
RQ4在如 TL;DR 和 HH-RLHF-Helpfulness 这样的常见数据集上，LLM评审与人类偏好的相对对齐度如何？
RQ5在给定数据集上，哪些LLM评审（模型+模板）在 Acc_both 中表现最佳，应该如何排名？

主要发现

提示模板对跨数据集的LLM评审准确性具有显著影响。
在 TL;DR 与 HH-RLHF-Helpfulness 数据上，LLM评审与人类评估者的对齐度一般。
在测试的评审中，准确性与位置偏见之间存在显著负相关。
所有测试的LLM评审在多轮对话中普遍对较长回答存在偏好。
GPT-4o 与 GPT-4o-mini 在准确性上通常优于 GPT-3.5-turbo，提示模板效应不同。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。