QUICK REVIEW

[論文レビュー] Humans or LLMs as the Judge? A Study on Judgement Biases

Guiming Hardy Chen, Shunian Chen|arXiv (Cornell University)|Feb 16, 2024

Law, Economics, and Judicial Systems被引用数 9

ひとこと要約

本論文は、開放型回答を評価する際に人間と大規模言語モデル（LLMs）の5つの判断バイアスを研究するフレームワークを提案し、摂動を用いた大規模実験を行い、両グループが悪用可能なバイアスを示すことを、オープンソースのデータセットによって裏付けている。

ABSTRACT

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLMs, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.

研究の動機と目的

オープンエンドのタスクにおける人間およびLLM審査者のバイアスを調べることにより、LLMsの堅牢な評価を促進する。
5つの審査バイアス（Fallacy Oversight、Authority、Beauty、Verbosity、Positional）を定義・分類し、その影響を検証する。
真理値の参照に依存しない介入/事後分析フレームワークを開発する。
オープンエンド評価用の公開データセットを作成・公開し、バイアス分析を促進する。

提案手法

5つのバイアスを真の基準を必要とせず評価する介入と事後分析フレームワークを設計する。
GPT-4を用いてブルームの改訂版分類に沿った質問と解答のペアを生成し、意味的品質について人間の判断を収集する。
事実誤認や偽の参照、豊富な内容を含む介入で回答を攪乱し、脆弱性（Attack Successful Rate、ASR）を測定する。
統制群と実験群の下で、人間審査者のセットと代表的なLLM（例：GPT-4、GPT-4-Turbo、Claude-2、PaLM-2、Ernie、LLaMA2 など）を評価する。
ASRと正確性を計算して攪乱への頑健性を定量化し、バイアスを特定する。
事後分析と回答位置をシャッフルした複数の評価ラウンドを通じて、位置依存性と冗長性のバイアスを分析する。

Figure 1: Sample demonstration. Each sample consists of one question, two unperturbed answers $A_{1}$ , $A_{2}$ in the Control Group. The perturbed versions of $A_{2}$ are generated for the Experimental Group. Texts with factual errors are colored in red solely for demonstration purposes. Rich conte

実験結果

リサーチクエスチョン

RQ1黄金の基準が存在しない開放型生成を評価する際、人間とLLMsはどの程度バイアスを持つのか？
RQ2Fallacy Oversight、Authority、Beauty、Verbosity、Positionalのバイアスが人間とLLM審査者の双方でどのように現れ、どの程度の大きさを持つのか？
RQ3これらのバイアスを悪用するよう設計された攪乱に対して、異なる審査者はどれだけ影響を受けやすいのか？
RQ4LLM審査者のバイアスを利用して、弱い回答や攪乱された回答を表面的に有利な評価に誘導できるか？
RQ5これらのバイアスを緩和する予防策（例: 複数のランダム化された位置評価など）は何か、また公開データセットは堅牢な評価研究をどう支援できるか？

主な発見

人間とLLMの審査者の双方が、開放エンド評価においてバイアスを示す。
人間の審査者はFallacy Oversight、Beauty、Verbosityの大きなバイアスを示し、LLMsはモデルごとに異なるバイアスを示す。
異なるLLMsはそれぞれ異なるバイアス特性を持ち、特定の攪乱に対してより頑健なものもあればそうでないものもある。
バイアス攪乱は攪乱されたり弱い回答の判断を表面的に改善するために利用でき、LLM評価者への偏った攻撃を可能にする。
本研究はさらなるバイアス分析と堅牢な評価システムの開発を支援するオープンソースの開放的評価データセットを提供する。

Figure 3: Verbosity Bias of different judges. The X-Axis indicates the absolute length difference between the long answer and the short answer. Lengths are computed using tiktoken library from OpenAI. The Y-Axis indicates the preference towards the long answer. 0 refers to a total favor for the shor

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。