Skip to main content
QUICK REVIEW

[论文解读] RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Shreyas Chaudhari, Pranjal Aggarwal|arXiv (Cornell University)|Apr 12, 2024
Topic Modeling被引用 8
一句话总结

本论文分析了大语言模型的 RLHF 基础,聚焦奖励模型、其训练、局限性,以及不完美奖励在 RL 框架中的含义。

ABSTRACT

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.

研究动机与目标

  • 通过解决预训练语言模型中的目标不匹配来推动 RLHF。
  • 通过贝叶斯视角检视 RLHF,以理解奖励建模和反馈的整合。
  • 分析奖励函数及其在 RLHF 中的作用与局限,以及奖励模型训练。

提出的方法

  • 将文本生成形式化为使用马尔可夫决策过程(MDP)的序列决策过程。
  • 考察奖励的构造,包括高信息 oracle 奖励、人工反馈,以及用于成对偏好 的 Bradley–Terry 模型。
  • 将奖励建模视为回归问题,并分析偏好数据的似然性。
  • 分析不完美奖励及函数近似对 RLHF 性能的影响。
  • 提供对 RLHF 组件及 RL 基于微调替代方案的文献综述。

实验结果

研究问题

  • RQ1在从人类反馈推断奖励模型时,Pr(DHF|φ) 的形式是什么?
  • RQ2奖励建模选择和不完美奖励如何影响 RLHF 训练和语言模型对齐?
  • RQ3从有限的人类反馈训练出的奖励模型存在哪些局限性和泛化挑战?
  • RQ4RLHF 如何融入消除预训练语言模型目标不匹配的更广泛框架?

主要发现

  • 奖励模型是 RLHF 的核心,其设计选择对对齐设定了根本性上限。
  • 奖励数据通常稀缺,导致泛化问题和对未见输入的误泛化。
  • 不完美、潜在稀少或设定错误的奖励会降低语言模型的性能和对齐。
  • 贝叶斯解释强调在给定人类反馈数据的情况下对奖励模型参数的 MAP 估计。
  • 本文记录了当前 RLHF 实践的局限性,并调查了超越带奖励模型的 RL 的替代和扩展。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。