Skip to main content
QUICK REVIEW

[论文解读] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang|arXiv (Cornell University)|Jun 9, 2023
Topic Modeling被引用 432
一句话总结

The paper evaluates whether strong LLMs can serve as judges of chatbots by comparing LLM-based judgments to human preferences using MT-bench and Chatbot Arena, finding GPT-4 matches human agreement at over 80%.

ABSTRACT

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

研究动机与目标

  • Motivate the need for evaluating LLM-based chatbots beyond traditional capability benchmarks.
  • Propose LLMs as judges to estimate human preferences for open-ended, multi-turn interactions.
  • Create two benchmarks (MT-bench and Chatbot Arena) to measure human-aligned evaluation.
  • Analyze biases and limitations of LLM judges and propose mitigation strategies.
  • Release datasets and promote a hybrid evaluation framework combining capability and preference benchmarks.

提出的方法

  • Introduce three LLM-as-a-judge variations: pairwise comparison, single answer grading, and reference-guided grading.
  • Investigate biases including position bias, verbosity bias, and self-enhancement bias, and evaluate mitigation techniques.
  • Use MT-bench (80 multi-turn questions, 3K expert votes) and Chatbot Arena (30K crowd votes) to compare LLM judges with human preferences.
  • Evaluate agreement between GPT-4 judges and humans on MT-bench and Arena datasets under multiple setups.
  • Explore enhancements such as swapping positions, few-shot judges, chain-of-thought prompts, reference-guided judgments, and fine-tuning judges.
  • Provide data release of MT-bench questions, expert votes, and arena conversations.

实验结果

研究问题

  • RQ1Can strong LLMs (e.g., GPT-4) replicate human preferences for open-ended, multi-turn chatbot interactions?
  • RQ2What biases affect LLM-based judgments (position, verbosity, self-enhancement) and how can they be mitigated?
  • RQ3Do LLM judges agree with human evaluators across controlled (MT-bench) and crowdsourced (Chatbot Arena) settings?
  • RQ4What is the added value of reference guidance, chain-of-thought, or few-shot prompting for judge reliability?
  • RQ5How do model variants and training data influence evaluation outcomes when using LLM-as-a-judge?

主要发现

  • GPT-4 as judge achieves over 80% agreement with human preferences on MT-bench, matching human-human agreement levels.
  • GPT-4 single-answer grading aligns well with pairwise judgments and humans, offering scalability.
  • Position and verbosity biases exist but can be mitigated; some biases are model-dependent (e.g., name bias in Claude-v1).
  • Reference-guided and chain-of-thought prompting substantially reduce math/reasoning grading failures for judges.
  • MT-bench and Chatbot Arena complement standard benchmarks; GPT-4 judge performance tracks human preferences across model pairs and categories.
  • Fine-tuning on high-quality dialog data can improve MMLU/truthful QA and MT-bench outcomes, but no single benchmark fully determines model quality.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。