Skip to main content
QUICK REVIEW

[论文解读] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin|arXiv (Cornell University)|Oct 12, 2023
Topic Modeling被引用 15
一句话总结

Prometheus 是一个开源的13B评测型LLM,通过新的 Feedback Collection 数据集进行微调,以实现对长文本的细粒度、基于评分表的评估,与人类判断的相关性接近 GPT-4 水平,并且超越开源基线。

ABSTRACT

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://kaistai.github.io/prometheus/.

研究动机与目标

  • 通过训练一个能够处理数千个定制化评分标准的评估者LLM,推动开放、透明、成本有效的评估。
  • 创建一个数据集(Feedback Collection),包含细粒度评分表、参考答案及反馈,用于训练评估者。
  • 证明包含参考材料并基于反馈进行微调可以提升细粒度评估能力。
  • 在与人工评估、GPT-4评估以及基于排名的奖励模型基准对比中评估 Prometheus,以确立其作为奖励模型的普适性。

提出的方法

  • 构建 Feedback Collection,包含 1K 细粒度评分表、20K 指令,以及由 GPT-4 生成的 100K 条回应/反馈。
  • 在 Feedback Collection 上对 Llama-2-Chat 13B(以及 7B)进行微调,以创建 Prometheus。
  • 在多个基准和基线上使用 Absolute Grading 和 Ranking Grading 进行评估。
  • 附加参考材料(参考答案、评分标准)和类似 CoT 的反馈步骤以引发评估能力。
  • 比较与人工评估者和 GPT-4 的相关性,并通过人工成对判断评估反馈质量。
Figure 1: Compared to conventional, coarse-grained LLM evaluation, we propose a fine-grained approach that takes user-defined score rubrics as input.
Figure 1: Compared to conventional, coarse-grained LLM evaluation, we propose a fine-grained approach that takes user-defined score rubrics as input.

实验结果

研究问题

  • RQ1开源评估型LLM 是否能够在定制评分表上实现与人工评估的高相关性?
  • RQ2结合参考材料和细粒度评分表是否会提升评估能力,超越单维基线?
  • RQ3Prometheus 在多项评测基准上,与 GPT-4、GPT-3.5-Turbo 及其他开源基线相比如何?
  • RQ4Prometheus 能否在基于排名的人类偏好数据集中有效地充当奖励模型?

主要发现

  • Prometheus-13B 在 45 项定制评分表上与人类评估者的皮尔逊相关系数达到 0.897,接近 GPT-4(0.882)。
  • Prometheus 超越 GPT-3.5-Turbo(0.392),并在多个基准上与若干开源基线相匹配或超越。
  • 在成对反馈评估中,人工评判者在 58.67% 的情形下偏好 Prometheus 于 GPT-4,在 79.57% 的情形下偏好 Prometheus 于 GPT-3.5-Turbo。
  • 在四个基准上的1222个评分表中,Prometheus 与 GPT-4 的相关性高于若干基线,说明与 GPT-4 风格评估的强一致性。
  • Prometheus 在两个开源奖励模型基准(HHH Alignment & MT Bench Human Judgement)上取得最高准确性。
Figure 2: The individual components of the Feedback Collection . By adding the appropriate reference materials (Score Rubric and Reference Answer) and training on GPT-4’s feedback, we show that we could obtain a strong open-source evaluator LM.
Figure 2: The individual components of the Feedback Collection . By adding the appropriate reference materials (Score Rubric and Reference Answer) and training on GPT-4’s feedback, we show that we could obtain a strong open-source evaluator LM.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。