QUICK REVIEW

[论文解读] LLM Critics Help Catch LLM Bugs

Nat McAleese, Rai Michael Pokorny|arXiv (Cornell University)|Jun 28, 2024

Law, AI, and Intellectual Property被引用 8

一句话总结

这篇论文训练 RLHF 调整的 LLM 评论家（CriticGPT）来批评模型编写的代码，提升错误检测能力，并且常常优于人工评审；将人类与评论家结合可减少幻觉和挑剔点。

ABSTRACT

Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains "critic" models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.

研究动机与目标

解决大语言模型 RLHF 中人类评估的根本局限性。
通过训练 LLM 评论家生成对代码的自然语言批评，发展可扩展的监督机制。
将 CriticGPT 与人类批评进行比较评估，并分析人机协作的效果。
引入推理时采样方法（FSBS），在全面批评和幻觉风险之间取得平衡。

提出的方法

训练自回归评论策略，接受 (question, answer) 对并输出纯文本批评。
使用 RLHF，奖励模型来自承包商评价的批评，以优化评论策略（PPO）。
引入对抗性篡改步骤，让承包商插入细微的错误以创建高质量的评估数据。
应用 Force Sampling Beam Search (FSBS) 约束采样并选择在长度、要点突出性和准确性之间达到平衡的批评。
通过承包商对全面性、批评-包含错误（CBI）、挑剔点和总体有用性等方面的评分来评估批评。

实验结果

研究问题

RQ1基于 RLHF 训练的 LLM 评论家能否提高对模型生成代码的人类评估的准确性和有用性？
RQ2CriticGPT 的批评在检测插入错误方面与人类和 ChatGPT 的批评相比如何？
RQ3LLM 批评在全面性与幻觉之间的权衡是什么，FSBS 能否在这些权衡中有效导航？
RQ4人机团队（Human+CriticGPT）在生成高质量批评方面是否优于单独的人类或评论家？
RQ5基于评论家引导的评估是否能推广到非代码任务和真实世界数据分布？

主要发现

CriticGPT 的批评在插入错误的代码上明显比 ChatGPT 和人类批评更受欢迎。
CriticGPT 能发现比人类承包商更多的插入错误，且优于为代码审查支付的代表性人类。
人机团队（Human+CriticGPT）相较于任一方单独时，撰写的批评更全面且幻觉更少。
FSBS 允许在全面性和幻觉之间进行权衡，从而为批评质量提供帕累托前沿的选择。
在对抗性篡改数据上训练得到的批评质量高于不进行篡改的训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。