QUICK REVIEW

[論文レビュー] Evaluating Large Language Models at Evaluating Instruction Following

Zhiyuan Zeng, Jiatong Yu|arXiv (Cornell University)|Oct 11, 2023

Natural Language Processing Techniques被引用数 11

ひとこと要約

論文は LLMBar を導入し、指示遵守出力を見抜く能力を評価するメタ評価ベンチマークを提示し、LLM評価者と人間の間のギャップを埋めるプロンプト戦略を示す。

ABSTRACT

As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ``LLM evaluators'', particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.

研究の動機と目的

異なる LLM 評価者が厳密に選定されたベンチマークを用いて指示遵守出力を判断する際のパフォーマンスを評価する。
表面的な質より指示遵守を重視した客観的で再現可能なベンチマークを提供する。
指示遵守に関する人間の判断と LLM 評価者の整合性を高めるプロンプティング戦略を探る。

提案手法

LLMBar を (I, O1, O2, p) の 4-tuple として定義し、419 インスタンスに分割された Natural と Adversarial のセセット。
Natural セットを AlpacaFarm および LLMEval 2 から組み立て、客観的な指示遵守傾向を保証するフィルタを適用する。
Adversarial セットを four strategies (Neighbor, GPTInst, GPTOut, Manual) により作成し、表面的に魅力的だが指示違反の出力で評価者を挑戦させる。
ベースとなる LLM（GPT-4, ChatGPT, LLaMA-2-Chat, PaLM2, Falcon）を評価者として、さまざまな prompting strategies を適用して評価する。
新しいプロンプティング戦略（Rules, Metrics, Swap）と組み合わせ（例: Metrics+Reference）を導入し、評価者の正確さを向上させ、偏りを減らす。

Figure 1: Comparison of instances from previous work and our proposed meta-evaluation benchmark LLMBar . LLMBar curates output pairs that have objective preferences. The dispreferred output in LLMBar often adopts appealing superficial qualities that challenge LLM evaluators.

実験結果

リサーチクエスチョン

RQ1異なるベースの LLM と prompting strategy は LLMBar での評価者のパフォーマンスにどのような影響を与えるか。
RQ2LLMBar は既存のメタ評価ベンチマークと比較して指示遵守評価をどのように反映するか。
RQ3新しい prompting 戦略は LLM 評価者と人間の専門家判断とのギャップを埋めることができるか。
RQ4Adversarial と Natural のサブセットは評価者のパフォーマンスにどのような影響を与えるか。
RQ5LLM ベースまたは人間の好みに基づく報酬モデルは LLMBar の判断と整合するか。

主な発見

LLM 評価者は LLMBar で人間を下回る、特に Adversarial セットで顕著。
Rules+Metrics+Reference のようなプロモーティング戦略は、モデルを問わず評価者の正確さを一貫して向上させる。
CoT ベースのプロンプトは、表面的に優れた出力に偏ることで性能を低下させる可能性がある。
Swap を組み合わせたプロンプト戦略は、精度を損なうことなく位置的合意を改善する。
GPT-4 ベースの評価者は Adversarial で他より優れているが、専門家の人間の同意（例: 95%）にはまだ及ばない。
人間または LLM の注釈で訓練された報酬モデルは、人間の同意レベルと比較して LLMBar での性能が低い。

Figure 2: Illustration of the Adversarial set collection process (except the Manual subset). Given an instruction $I$ and a preferred output $O_{1}$ , we either collect a closely related but different enough instruction $I^{\prime}$ and generate dispreferred (adversarial) output $O_{2}$ (in Neighbor

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。