[Paper Review] Deep reinforcement learning from human preferences
The paper learns policies by training a reward model from human trajectory segment preferences and optimizing with RL, enabling complex tasks without access to true rewards. It demonstrates Atari and MuJoCo tasks with minimal human feedback.
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
Motivation & Objective
- Motivate reinforcement learning in domains with hard-to-specify rewards.
- Develop a scalable method to learn from human preferences rather than absolute rewards.
- Show that small amounts of non-expert human feedback can guide deep RL in large tasks.
- Demonstrate learned behaviors in Atari and MuJoCo that are difficult to hand-design rewards.
Proposed method
- Maintain a policy pi and a reward predictor hat{r}, both parameterized by deep nets.
- Collect trajectory segments and query humans to compare pairs of segments.
- Fit hat{r} by maximizing a cross-entropy loss over human preferences using a Bradley–Terry-type model.
- Train the policy with RL using the predicted reward hat{r} as the reward signal.
- Use an ensemble of reward predictors and average their outputs to stabilize learning.
- Select queries by sampling segment pairs and choosing those with high ensemble disagreement.
Experimental results
Research questions
- RQ1Can human preferences over short trajectory clips provide enough signal to train a deep RL agent without a native reward function?
- RQ2How much and what type of human feedback (real vs. synthetic/oracle) is needed to achieve near-RL performance on complex tasks?
- RQ3Does online human feedback prevent reward mis-specification and exploitation by the agent?
- RQ4Can the approach scale to complex domains (Atari, MuJoCo) and yield novel behaviors not easily hand-crafted in rewards?
Key findings
- The approach enables solving complex RL tasks in Atari and MuJoCo with much less human time than full demonstrations or reward engineering.
- With hundreds to thousands of human comparisons, the method nearly matches RL performance on several MuJoCo tasks and some Atari games.
- Real human feedback often performs similarly to or slightly worse than synthetic feedback, depending on the task and labeling consistency.
- The method can learn novel behaviors (e.g., backflips, driving with traffic) in under an hour of human time.
- Offline reward predictor training without online updates can fail, showing the importance of integrating human feedback with ongoing RL.
- The use of an ensemble for hat{r} and comparing trajectory clips improves learning stability and alignment with human judgments.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.