Skip to main content
QUICK REVIEW

[论文解读] A Survey of Reinforcement Learning from Human Feedback

Timo Kaufmann, Paul Weng|arXiv (Cornell University)|Dec 22, 2023
Software Engineering Research被引用 34
一句话总结

This paper provides a comprehensive overview of RLHF, detailing fundamentals, feedback types, reward modeling, theory, applications, benchmarks, and future directions beyond just LLMs.

ABSTRACT

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning provides a promising approach to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The success in training large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF has played a decisive role in directing the model's capabilities towards human objectives. This article provides an overview of the fundamentals of RLHF, exploring how RL agents interact with human feedback. While recent focus has been on RLHF for LLMs, our survey covers the technique across multiple domains. We provide our most comprehensive coverage in control and robotics, where many fundamental techniques originate, alongside a dedicated LLM section. We examine the core principles that underpin RLHF, how algorithms and human feedback work together, and the main research trends in the field. Our goal is to give researchers and practitioners a clear understanding of this rapidly growing field.

研究动机与目标

  • Explain why human feedback is used to define and refine RL objectives.
  • Survey the taxonomy of RLHF approaches, especially reward modeling and interactive learning.
  • Summarize key methods, data collection, and evaluation practices in RLHF.
  • Synthesize theoretical insights and practical benchmarks to guide future research.

提出的方法

  • Describe the RLHF framework with reward learning followed by RL training.
  • Map feedback types to PbRL, SSRL, and RLHF for taxonomy.
  • Discuss reward model training via Bradley-Terry style likelihood for trajectory comparisons.
  • Review active label collection, data efficiency techniques, and evaluation practices.
  • Summarize theoretical results linking policy learning to RLHF objectives.
  • Survey applications, libraries, and benchmarks in RLHF.

实验结果

研究问题

  • RQ1What are the main components and principles that define RLHF?
  • RQ2How do different feedback types fit within PbRL, SSRL, and RLHF and how do they influence reward modeling?
  • RQ3What are the prevailing methods for reward learning from human feedback and how are they evaluated?
  • RQ4What theoretical guarantees or insights exist for RLHF, and how do they relate to standard RL?
  • RQ5What are the current applications, benchmarks, and practical considerations in RLHF beyond LLMs?

主要发现

  • RLHF generalizes PbRL by incorporating a broader set of feedback types beyond simple trajectory comparisons.
  • Reward modeling via probabilistic formulations like the Bradley-Terry model enables learning a reward function from human preferences.
  • Reward model training and RL policy learning are typically decomposed into reward learning and policy optimization, enabling semi-supervised learning.
  • Active and interactive label collection, along with data augmentation and meta-learning, improve feedback efficiency and adaptability.
  • There is growing theoretical work linking RLHF to standard RL and providing insights into alignment and safety, as well as diverse applications and benchmarks beyond LLMs.
  • A wide range of applications, supporting libraries, and benchmarks exist, illustrating RLHF’s broad impact and practical relevance.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。