QUICK REVIEW

[論文レビュー] From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Dawei Li, Bocheng Jiang|arXiv (Cornell University)|Nov 25, 2024

Legal Education and Practice Innovations被引用数 10

ひとこと要約

本論文は LLM-as-a-judge パラダイムを概観し、入力/出力形式を定義し、何を/どう/どこで評価するかの3次元分類を提示し、評価ベンチマークを整理し、主要な課題と将来の方向性を概説する。

ABSTRACT

Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). Traditional methods, usually matching-based or small model-based, often fall short in open-ended and dynamic scenarios. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios. This paper presents a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to review this evolving field. We first provide the definition from both input and output perspectives. Then we introduce a systematic taxonomy to explore LLM-as-a-judge along three dimensions: what to judge, how to judge, and how to benchmark. Finally, we also highlight key challenges and promising future directions for this emerging area. More resources on LLM-as-a-judge are on the website: https://llm-as-a-judge.github.io and https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge.

研究の動機と目的

入力および出力の観点から LLM-as-a-judge の定義を公式化する。
judging 属性・方法論・適用分野の包括的な分類法を提案する。
LLMベースの判断を評価するためのベンチマークを収集・要約する。
課題を特定し、将来の研究の有望な方向性を示す。

提案手法

入力形式を定義する（ポイントごと、ペア/リスト形式）と出力形式（スコア、ランキング、選択）。
何を評価するか（属性）、どう評価するか（調整とプロンプティング）、どこで評価するか（適用分野）の3次元分類法を開発する。
有用性、無害性、信頼性、関連性、実現可能性、全体的な品質などの属性を調査する。
チューニング技術（データソース、監視付きファインチューニング、好み学習）とプロンプティング戦略（スワッピング、ルール拡張、マルチエージェント協調）を要約する。
タスク間での LLM ベースの判断を評価するための既存ベンチマークを列挙し、分類する。

Figure 1: Overview of various input and output formats of LLM-as-a-judge.

実験結果

リサーチクエスチョン

RQ1LLMs はどの属性をどのように効果的に判断でき、これらの属性はどのように定義・測定されるか？
RQ2どのようなチューニングとプロンプティングの方法論がタスク全体で堅牢な LLMベースの判断を可能にするか？
RQ3どのアプリケーションで LLM-as-a-judge アプローチは現在使用され、どのようにベンチマークされているか？
RQ4LLM-as-a-judge 研究の主な課題と今後の方向性は何か？

主な発見

この論文は、LLM を用いて何を、どう、どこで判断するかを扱う詳細な分類法を提示する。
有用性、無害性、信頼性、関連性、実現可能性、全体的な品質などの広範な属性を整理している。
チューニング手法（SFT、好み学習、合成データ）とプロンプティングのコツ（スワッピング、ルール拡張、マルチエージェント設定）をレビューしている。
ベンチマークをまとめ、LLM-as-a-judge の活用を評価、整合性、検索、推論の応用にマッピングしている。
偏り、脆弱性、判断の動的性、人的–LLM の共同判断など、課題を議論している。

Figure 3: LLMs are capable of judging various attributes.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。