QUICK REVIEW

[Paper Review] Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning

Noah Siegel, Jost Tobias Springenberg|arXiv (Cornell University)|Feb 19, 2020

Reinforcement Learning in Robotics33 references48 citations

TL;DR

The paper introduces an advantage-weighted behavior model (ABM) prior to stabilize offline reinforcement learning by biasing the policy toward actions seen in data that are likely to succeed on the current task, enabling stable learning from heterogeneous data sources.

ABSTRACT

Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batch-RL that enables stable learning from conflicting data-sources. We find improvements on competitive baselines in a variety of RL tasks -- including standard continuous control benchmarks and multi-task learning for simulated and real-world robots.

Motivation & Objective

Motivation to learn from fixed batch data when online interaction is impossible or costly, especially in robotics.
Develop a method that leverages arbitrary behavior data while avoiding actions not supported by the data.
Stabilize policy improvement by constraining updates to stay close to a learned data-driven prior.
Demonstrate improved stability and performance across continuous control benchmarks and multi-task robotics tasks.

Proposed method

Propose a policy iteration framework where the policy is improved under a constraint that keeps it close to a learned prior.
Learn a prior policy, either as a simple behavior model (BM) or as an advantage-weighted behavior model (ABM) that emphasizes data-supported, task-relevant actions.
Evaluate Q with a TD error minimization using the current policy for the V target, avoiding max-over-actions in offline settings.
In the policy improvement step, maximize expected Q under a KL constraint against the prior: Eτ[ Ea~π(a|s)[Q̂πi(s,a)] ] subject to KL(π(·|s) || π_prior(·|s)) ≤ ε.
Optionally implement EM-style optimization (MPO-inspired) or stochastic value gradient optimization to solve the constrained objective.
The ABM objective weights data snippets by a function of their realized advantage R(τt:N) − V̂πi(st), focusing on beneficial actions while staying within data support.

Experimental results

Research questions

RQ1Can an adaptive, data-driven prior enable stable offline RL from mixed behavior data and multiple tasks?
RQ2Does constraining policy improvement to a learned prior prevent overestimation and extrapolation errors in fixed batch RL?
RQ3How does an advantage-weighted prior (ABM) compare to a plain behavior model prior in handling conflicting or multimodal data?
RQ4Can the proposed approach achieve multi-task learning and transfer in robotic manipulation from offline data?
RQ5Is the policy evaluation step sufficient to stabilize learning when a policy iteration scheme is used with offline data?

Key findings

The ABM prior enables stable learning from batch data and improves performance on continuous control benchmarks compared to strong offline baselines.
BM priors help in simple domains, but ABM better handles conflicting data and multimodal behavior as seen in Hopper and Quadruped tasks.
ABM-enhanced methods achieve competitive or superior results to BEAR and BCQ baselines on control-suite tasks and multi-task robotic manipulation in simulation.
The approach also shows that ABM+MPO can learn new tasks from data containing related trajectories and can re-learn seven tasks on a real Sawyer robot from logged data in reduced time.
Using ABM with offline MPO yields improvements across both simulated and real-world robotic experiments, including multi-task learning and data-driven task transfer.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.