QUICK REVIEW

[論文レビュー] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin|arXiv (Cornell University)|Oct 12, 2023

Topic Modeling被引用数 15

ひとこと要約

Prometheusは、新しいFeedback Collectionデータセットで訓練されたオープンソースの13B評価LLMであり、長文のテキストの細かなルーブリック駆動の評価を行い、人間の判断との相関をGPT-4レベル近くに達成し、オープンソースベースラインを上回る。

ABSTRACT

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://kaistai.github.io/prometheus/.

研究の動機と目的

Motivate open, transparent, cost-effective evaluation by training an evaluator LLM that can handle thousands of customized score rubrics.
Create a dataset (Feedback Collection) with fine-grained rubrics, reference answers, and feedback to train an evaluator.
Demonstrate that including reference materials and fine-tuning on feedback improves fine-grained evaluation capability.
Evaluate Prometheus against human judgments, GPT-4 evaluations, and ranking-based reward-model benchmarks to establish universality as a reward model.

提案手法

Construct the Feedback Collection with 1K fine-grained rubrics, 20K instructions, and 100K responses/feedback generated by GPT-4.
Fine-tune Llama-2-Chat 13B (and 7B) on the Feedback Collection to create Prometheus.
Evaluate using Absolute Grading and Ranking Grading across multiple benchmarks and baselines.
Append reference materials (reference answers, scoring rubrics) and a CoT-like feedback step to induce evaluation capability.
Compare correlations with human evaluators and GPT-4, and assess feedback quality via human pairwise judgments.

Figure 1: Compared to conventional, coarse-grained LLM evaluation, we propose a fine-grained approach that takes user-defined score rubrics as input.

実験結果

リサーチクエスチョン

RQ1Can an open-source evaluator LLM achieve high correlation with human evaluation on customized rubrics?
RQ2Does incorporating reference materials and fine-grained rubrics improve evaluation capability beyond single-dimensional baselines?
RQ3How does Prometheus compare to GPT-4, GPT-3.5-Turbo, and other open-source baselines across multiple evaluation benchmarks?
RQ4Can Prometheus function effectively as a reward model in ranking-based human preference datasets?

主な発見

Prometheus-13B attains a Pearson correlation of 0.897 with human evaluators on 45 customized rubrics, approaching GPT-4 (0.882).
Prometheus outperforms GPT-3.5-Turbo (0.392) and matches or surpasses several open baselines across multiple benchmarks.
In a pairwise feedback evaluation, Prometheus is preferred over GPT-4 in 58.67% of cases and over GPT-3.5-Turbo in 79.57% of cases by human judges.
Across 1222 rubrics on four benchmarks, Prometheus shows higher correlation with GPT-4 than several baselines, illustrating strong alignment with GPT-4-style evaluation.
Prometheus achieves the highest accuracy among open-source reward models on two human preference benchmarks (HHH Alignment & MT Bench Human Judgement).

Figure 2: The individual components of the Feedback Collection . By adding the appropriate reference materials (Score Rubric and Reference Answer) and training on GPT-4’s feedback, we show that we could obtain a strong open-source evaluator LM.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。