QUICK REVIEW

[Paper Review] Interactive Learning from Policy-Dependent Human Feedback

James MacGlashan, Mark K. Ho|arXiv (Cornell University)|Jan 21, 2017

Reinforcement Learning in Robotics24 references108 citations

TL;DR

The paper shows that human feedback depends on the learner's current policy and introduces COACH, an actor-critic-based algorithm that converges when learning from policy-dependent feedback, demonstrated in simulations and on a TurtleBot robot.

ABSTRACT

This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner's current policy. We present empirical results that show this assumption to be false -- whether human trainers give a positive or negative feedback for a decision is influenced by the learner's current policy. Based on this insight, we introduce {\em Convergent Actor-Critic by Humans} (COACH), an algorithm for learning from policy-dependent feedback that converges to a local optimum. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot.

Motivation & Objective

Demonstrate that human-provided feedback varies with the learner's current policy (policy-dependent feedback) and not just action quality.
Develop and formalize an algorithm (COACH) that learns from policy-dependent feedback and converges to a local optimum.
Validate COACH in both simulated domains and real-robot experiments to show scalability and robustness across tasks.

Proposed method

Introduce the advantage function Aπ(s,a)=Qπ(s,a)−Vπ(s) as the model of human feedback.
Derive an update rule Δθt∝∇θπ(st,at) f t / π(st,at) that yields convergence when feedback equals Qπ or Aπ as appropriate.
Present Real-time COACH with reward aggregation and eligibility traces to handle variable feedback magnitude, timing delay, and sparse feedback.
Use multiple eligibility traces with different decay rates to apply feedback to relevant past actions via traces (λ).
Compare COACH to Q-learning and TAMER in controlled domains to assess robustness to different feedback strategies.
Demonstrate Real-time COACH on a TurtleBot with five learned behaviors using differential and diminishing feedback.

Experimental results

Research questions

RQ1Does human feedback depend on the learner's current policy (policy-dependent feedback) in interactive learning settings?
RQ2Can an actor-critic framework be designed to converge when trained with policy-dependent feedback (COACH)?
RQ3How does COACH perform relative to existing HCRL approaches (e.g., TAMER) under various feedback strategies?
RQ4Is COACH scalable to real-robot domains with high-frequency decisions and perceptual noise?
RQ5What are practical considerations (delay, sparsity, reward magnitude) for real-time policy-dependent feedback?

Key findings

Human trainers provide feedback whose sign and magnitude depend on the learner's policy, not solely on action quality.
COACH converges to a local optimum when using policy-dependent feedback by leveraging the advantage-based feedback model.
In simulations, COACH outperforms alternatives under improvement-based feedback, while TAMER performs best with action-based feedback and can fail under certain strategies.
Real-time COACH enables learning on a TurtleBot to acquire five distinct behaviors within two minutes, using differential and diminishing feedback.
TAMER can forget previously learned behaviors under some compositional training and lure scenarios, whereas COACH maintains stable learning with policy-dependent feedback.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.