QUICK REVIEW

[论文解读] Humans or LLMs as the Judge? A Study on Judgement Biases

Guiming Hardy Chen, Shunian Chen|arXiv (Cornell University)|Feb 16, 2024

Law, Economics, and Judicial Systems被引用 9

一句话总结

本文提出一个框架，用于研究人类和大语言模型在评估开放式答案时的五种判断偏差，进行了大量带扰动的实验，并表明两组都存在可被利用的偏差，且有一个开源数据集支持。

ABSTRACT

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLMs, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.

研究动机与目标

通过考察人类与LLM评审在开放式任务中的偏差，推动对LLM评估的鲁棒性评估。
定义并归类五种评审偏差（Fallacy Oversight，Authority，Beauty，Verbosity，Positional）并测试它们的影响。
开发一个干预/事后分析框架，不依赖真实参考。
创建并发布一个用于开放式评估的开源数据集，以促进偏差分析。

提出的方法

设计一个干预与事后分析框架，在不需要地面真实标准的情况下评估五种偏差。
使用 GPT-4 生成跨布鲁姆修订分类法的问题及答案对，并收集对语义质量的人类判断。
对答案进行事实错误、伪参考和丰富内容的扰动，以衡量脆弱性（攻击成功率，ASR）。
在受控组和实验组下评估一组人类评审和具有代表性的LLMs（如 GPT-4、GPT-4-Turbo、Claude-2、PaLM-2、Ernie、LLaMA2 等）。
计算 ASR 与准确率，以量化对扰动的鲁棒性并识别偏差。
通过事后分析和多轮评估（答案位置打乱）分析位置偏差和冗长偏差。

Figure 1: Sample demonstration. Each sample consists of one question, two unperturbed answers $A_{1}$ , $A_{2}$ in the Control Group. The perturbed versions of $A_{2}$ are generated for the Experimental Group. Texts with factual errors are colored in red solely for demonstration purposes. Rich conte

实验结果

研究问题

RQ1在没有黄金标准的情况下评估开放式生成时，人类和LLMs 的偏差有多大？
RQ2在人类和LLM评审中，Fallacy Oversight、Authority、Beauty、Verbosity、Positional 等偏差的表现形式和强度何在？
RQ3不同评审对设计用以利用这些偏差的扰动有多易受影响？
RQ4是否可以利用LLM评审中的偏差来产生对较弱或经扰动的答案的表层有利评估？
RQ5哪些应急措施（如多次随机位置评估）可以缓解这些偏差，开源数据集如何支持鲁棒性评估研究？

主要发现

在人类和LLM评审的开放式评估中，二者都存在偏差。
人类评审显示显著的 Fallacy Oversight、Beauty 和 Verbosity 偏差；LLMs 在不同模型上呈现不同的偏差。
不同的 LLM 具有不同的偏差特征，一些对某些扰动比其他模型更鲁棒。
偏差扰动可被用来表面上提升对扰动或较弱答案的判断，从而对LLM评审进行有偏攻击。
该研究提供了用于开放式评估的开源数据集，以支持进一步的偏差分析和鲁棒评估系统的开发。

Figure 3: Verbosity Bias of different judges. The X-Axis indicates the absolute length difference between the long answer and the short answer. Lengths are computed using tiktoken library from OpenAI. The Y-Axis indicates the preference towards the long answer. 0 refers to a total favor for the shor

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。