QUICK REVIEW

[논문 리뷰] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin|arXiv (Cornell University)|2023. 10. 12.

Topic Modeling인용 수 15

한 줄 요약

Prometheus는 새로운 Feedback Collection 데이터셋으로 학습된 오픈소스 13B 평가 LLM으로, 장문 텍스트에 대한 세밀하고 루브릭 기반의 평가를 수행하며 인간 판단과의 상관도에서 GPT-4 수준에 근접하고 오픈소스 기본모델보다 성능이 우수합니다.

ABSTRACT

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://kaistai.github.io/prometheus/.

연구 동기 및 목표

투명하고 비용 효율적인 평가를 개방적으로 촉진하기 위해 수천 개의 맞춤형 점수 루브릭을 다룰 수 있는 평가자 LLM을 학습시킴으로써 평가를 가능하게 한다.
세밀한 루브릭, 참조 답안 및 피드백을 포함하는 데이터셋(Feedback Collection)을 만들어 평가자를 학습시킨다.
참조 자료와 피드백에 대한 미세 조정을 포함시키면 세밀한 평가 능력이 향상됨을 증명한다.
Prometheus를 인간 평가, GPT-4 평가, 랭킹 기반 보상 모델 벤치마크와 비교 평가하여 보상 모델로서의 보편성을 확립한다.]
method hardcode: ["Feedback Collection을 1K개의 세밀한 루브릭, 20K개의 지시문, 그리고 100K개의 응답/피드백을 GPT-4로 생성하여 구성한다.","Feedback Collection을 바탕으로 Llama-2-Chat 13B(또는 7B)를 미세 조정하여 Prometheus를 만든다.","Absolute Grading과 Ranking Grading을 다양한 벤치마크와 baselines에서 평가한다.","참조 자료(참조 답안, 점수 루브릭)와 CoT 유사 피드백 단계를 추가하여 평가 능력을 유도한다.","인간 평가자 및 GPT-4와의 상관관계를 비교하고 인간 쌍대 판단을 통해 피드백 품질을 평가한다."]
research_questions:[
맞춤형 루브릭에서 인간 평가와 높은 상관관계를 달성하는 오픈소스 평가자 LLM이 가능할까?
참조 자료와 세밀한 루브릭을 포함시키면 단일 차원 벤치마크를 넘어 평가 능력이 향상될까?
Prometheus는 GPT-4, GPT-3.5-Turbo 및 기타 오픈소스 벤치마크들과 여러 평가 벤치마크에서 어떻게 비교될까?
Prometheus가 랭킹 기반 인간 선호 데이터셋에서 보상 모델로 효과적으로 작동할 수 있을까?

제안 방법

Feedback Collection을 1K개의 세밀한 루브릭, 20K개의 지시문, 그리고 100K개의 응답/피드백을 GPT-4로 생성하여 구성한다.
Feedback Collection을 바탕으로 Llama-2-Chat 13B(및 7B)를 미세 조정하여 Prometheus를 만든다.
여러 벤치마크와 baselines에서 Absolute Grading과 Ranking Grading으로 평가한다.
참조 자료(참조 답안, 점수 루브릭)와 CoT 유사 피드백 단계를 부가하여 평가 능력을 유도한다.
인간 평가자와 GPT-4, 그리고 피드백 품질을 인간 쌍대 판단으로 평가한다.

Figure 1: Compared to conventional, coarse-grained LLM evaluation, we propose a fine-grained approach that takes user-defined score rubrics as input.

실험 결과

연구 질문

RQ1Open-source 평가자 LLM이 맞춤형 루브릭에서 인간 평가와 높은 상관관계를 달성할 수 있는가?
RQ2참조 자료와 세밀한 루브릭을 포함시키면 단일 차원 벤치마크를 넘어 평가 능력이 향상되는가?
RQ3Prometheus는 GPT-4, GPT-3.5-Turbo, 그리고 다른 오픈소스 벤치마크들에 걸쳐 어떻게 비교되는가?
RQ4Prometheus가 랭킹 기반 인간 선호 데이터셋에서 보상 모델로서 효과적으로 작동하는가?

주요 결과

Prometheus-13B는 45개의 맞춤형 루브릭에서 인간 평가자와의 Pearson 상관관계가 0.897에 이르며 GPT-4(0.882)에 근접합니다.
Prometheus는 GPT-3.5-Turbo(0.392)를 상회하고 여러 벤치마크에서 여러 오픈 기본모델에 비해 동등하거나 그 이상을 기록합니다.
쌍대 피드백 평가에서 인간 판단에 의해 Prometheus가 GPT-4보다 58.67%의 사례에서 선호되고 GPT-3.5-Turbo보다 79.57%의 사례에서 선호됩니다.
네 가지 벤치마크의 1222개 루브릭에서 Prometheus는 GPT-4와의 상관관계가 여러 벤치마크 대비 더 높아 GPT-4 스타일 평가와 강한 정렬성을 보여줍니다.
Prometheus는 두 개의 인간 선호 벤치마크(HHH Alignment & MT Bench Human Judgement)에서 오픈소스 보상 모델 중 최고 정확도를 달성합니다.

Figure 2: The individual components of the Feedback Collection . By adding the appropriate reference materials (Score Rubric and Reference Answer) and training on GPT-4’s feedback, we show that we could obtain a strong open-source evaluator LM.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.