Skip to main content
QUICK REVIEW

[Paper Review] Trial without Error: Towards Safe Reinforcement Learning via Human Intervention

William S. Saunders, Girish Sastry|arXiv (Cornell University)|Jul 17, 2017
Reinforcement Learning in Robotics17 references110 citations
TL;DR

The paper formalizes Human Intervention RL (HIRL) to prevent catastrophes during training by having a human blocker imitate safe actions, training a Blocker to take over, and evaluating scalability in Atari games. Results show zero catastrophes in Pong/Space Invaders but partial success in Road Runner, with scaling challenges discussed.

ABSTRACT

AI systems are increasingly applied to complex tasks that involve interaction with humans. During training, such systems are potentially dangerous, as they haven't yet learned to avoid actions that could cause serious harm. How can an AI system explore and learn without making a single mistake that harms humans or otherwise causes serious damage? For model-free reinforcement learning, having a human "in the loop" and ready to intervene is currently the only way to prevent all catastrophes. We formalize human intervention for RL and show how to reduce the human labor required by training a supervised learner to imitate the human's intervention decisions. We evaluate this scheme on Atari games, with a Deep RL agent being overseen by a human for four hours. When the class of catastrophes is simple, we are able to prevent all catastrophes without affecting the agent's learning (whereas an RL baseline fails due to catastrophic forgetting). However, this scheme is less successful when catastrophes are more complex: it reduces but does not eliminate catastrophes and the supervised learner fails on adversarial examples found by the agent. Extrapolating to more challenging environments, we show that our implementation would not scale (due to the infeasible amount of human labor required). We outline extensions of the scheme that are necessary if we are to train model-free agents without a single catastrophe.

Motivation & Objective

  • Define a formal safety framework for model-free RL with human oversight to prevent catastrophes during training.
  • Propose HIRL: a human-in-the-loop scheme where a Blocker learns to imitate the human's blocking decisions to replace unsafe actions.
  • Evaluate HIRL on Atari games to assess safety performance and learning efficiency across agents.
  • Highlight scalability challenges and outline strategies to reduce human labor while maintaining zero-catastrophe safety where possible.

Proposed method

  • Model RL as an MDP and introduce a Human Oversight phase where a human blocks catastrophic actions and replaces them with safe actions.
  • Collect state-action data and labels on whether the human blocked, to train a Blocker classifier that imitates blocking decisions.
  • Once the Blocker reaches held-out performance, retire the human and let the Blocker supervise; the Blocker also handles action replacement.
  • Use a CNN-based Blocker trained on raw Atari frames to achieve low false-negative rates for catastrophes.
  • Compare HIRL against a reward-shaping baseline that penalizes catastrophes without blocking actions.
  • Analyze robustness to distribution shift and adversarial examples, and discuss data-efficiency and human-time costs.

Experimental results

Research questions

  • RQ1Can human intervention prevent all catastrophic actions during RL training across simple and complex catastrophe classes?
  • RQ2How well can a learned Blocker imitate human interventions and scale across different RL agents and environments?
  • RQ3What are the human time costs and scalability limits when applying HIRL to more complex tasks?
  • RQ4What extensions are needed to reduce human labor while maintaining zero-catastrophe learning in safe RL?

Key findings

  • HIRL achieved zero catastrophes in Pong and Space Invaders, while reducing but not eliminating catastrophes in Road Runner (by 50x).
  • The Blocker transfers across agents and architectures, blocking catastrophes without hindering learning in Pong.
  • Reward Shaping with large negative penalties failed to prevent all catastrophes due to catastrophic forgetting and adversarial exploitation.
  • Extrapolations suggest the current HIRL setup would be infeasible for longer or more complex tasks due to high human-time costs.
  • Blocker robustness can be compromised by adversarial agents, necessitating data-efficiency and active-learning strategies.
  • In Pong, catastrophes can be locally avoided but non-local catastrophes reveal limitations of blocking alone.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.