QUICK REVIEW

[논문 리뷰] Evaluating Large Language Models at Evaluating Instruction Following

Zhiyuan Zeng, Jiatong Yu|arXiv (Cornell University)|2023. 10. 11.

Natural Language Processing Techniques인용 수 11

한 줄 요약

본 논문은 LLMBar를 도입하여 LLM 평가자들의 지시 준수 출력 판별 능력을 검증하는 메타 평가 벤치마크를 제시하고, LLM 평가자와 인간 간의 격차를 줄이는 프롬프트 전략을 제시한다.

ABSTRACT

As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ``LLM evaluators'', particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.

연구 동기 및 목표

엄격하게 선별된 벤치마크를 사용하여 다양한 LLM 평가자들이 지시 준수 출력 판단에서 어떤 성능을 보이는지 평가한다.
표면적 품질보다 지시 준수를 강조하는 객관적이고 재현 가능한 벤치마크를 제공한다.
지시 준수에 대한 인간 판단과 LLM 평가자의 일치를 향상시킬 수 있는 프롬프트 전략을 탐구한다.

제안 방법

LLMBar를 (I, O1, O2, p) 형태의 튜플로 정의하고, Natural과 Adversarial 세트로 나뉘는 419개의 인스턴스로 구성한다.
자연(Natural) 세트는 AlpacaFarm과 LLMEval 2에서 구성하고, 객관적인 지시 준수 편향을 보장하기 위한 필터를 적용한다.
Adversarial 세트는 네 가지 전략(Neighbor, GPTInst, GPTOut, Manual)을 통해 구성하여 평가자에게 표면적으로 매력적이지만 지시를 위반하는 출력으로 도전한다.
다양한 프롬프트 전략으로 기본 LLM(GPT-4, ChatGPT, LLaMA-2-Chat, PaLM2, Falcon)을 평가자로 활용하여 평가한다.
새로운 프롬프트 전략(Rules, Metrics, Swap)과 조합(Metrics+Reference 등)을 도입하여 평가자의 정확성을 높이고 편향을 줄인다.

Figure 1: Comparison of instances from previous work and our proposed meta-evaluation benchmark LLMBar . LLMBar curates output pairs that have objective preferences. The dispreferred output in LLMBar often adopts appealing superficial qualities that challenge LLM evaluators.

실험 결과

연구 질문

RQ1다양한 기본 LLM과 프롬프트 전략이 LLMBar에서 평가자 성능에 어떤 영향을 미치는가?
RQ2LLMBar가 지시 준수 평가를 반영하는 기존 메타 평가 벤치마크와 어떻게 다른가?
RQ3새로운 프롬프트 전략이 지시 준수에 대한 LLM 평가자와 인간 전문가 판단 간의 격차를 줄일 수 있는가?
RQ4Adversarial 대 Natural 하위집합이 평가자 성능에 미치는 영향은 무엇인가?
RQ5LLM 기반 또는 인간 선호도에 대해 학습된 보상 모델은 LLMBar 판단과 일치하는가?

주요 결과

LLMBar에서 LLM 평가자는 인간보다 성능이 떨어지며, 특히 Adversarial 세트에서 그렇다.
Rules+Metrics+Reference와 같은 프롬프트 전략은 모델 전반에서 평가자 정확도를 일관되게 향상시킨다.
CoT 기반 프롬프트는 표면적으로 우수한 출력에 편향되어 성능을 저하시킬 수 있다.
Swap와의 프롬프트 전략 결합은 정확도를 해치지 않으면서 위치적 일치를 개선한다.
GPT-4 기반 평가자는 Adversarial에서 타 모델을 능가하지만 여전히 전문가 인간 합의(예: 95%)에 뒤처진다.
인간 또는 LLM 주석으로 학습된 보상 모델은 인간 합의 수준에 비해 LLMBar에서 성능이 낮다.

Figure 2: Illustration of the Adversarial set collection process (except the Manual subset). Given an instruction $I$ and a preferred output $O_{1}$ , we either collect a closely related but different enough instruction $I^{\prime}$ and generate dispreferred (adversarial) output $O_{2}$ (in Neighbor

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.