QUICK REVIEW

[論文レビュー] Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar|arXiv (Cornell University)|Oct 1, 2019

Reinforcement Learning in Robotics参考文献 29被引用数 165

ひとこと要約

AWR は experience replay を用いた2つの監督付き学習ステップ（value regression と advantage-weighted policy regression）を組み合わせたシンプルなオフポリシ RL アルゴリズムで、OpenAI Gym および複雑なモーション模倣タスクで競争力のある結果を達成します。

ABSTRACT

In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can accommodate continuous and discrete actions, and can be implemented in just a few lines of code on top of standard supervised learning methods. We provide a theoretical motivation for AWR and analyze its properties when incorporating off-policy data from experience replay. We evaluate AWR on a suite of standard OpenAI Gym benchmark tasks, and show that it achieves competitive performance compared to a number of well-established state-of-the-art RL algorithms. AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. Furthermore, we demonstrate our algorithm on challenging continuous control tasks with highly complex simulated characters.

研究の動機と目的

Develop a simple, scalable off-policy RL algorithm based on supervised learning losses.
Enable learning from off-policy data via experience replay with a stable, bounded update.
Demonstrate competitive performance against established on-policy and off-policy methods on standard benchmarks and motion imitation tasks.

提案手法

Two-stage supervised learning update: fit a value function by regression on returns, then fit the policy via weighted regression using exp(A/β) as weights.
Advantage A(s,a) = R(s,a) - V(s) guides the policy update.
Derivation treats AWR as a constrained policy search optimizing expected improvement with a KL-divergence constraint.
Extend to off-policy data by modeling the sampling policy as a mixture of past policies via a replay buffer.
Use TD(λ) for low-variance return estimates and clip weights to stabilize training.
Optional baseline V̄(s) computed as a weighted average of past value functions for multiple policies.

実験結果

リサーチクエスチョン

RQ1Can a simple, regression-based off-policy RL algorithm achieve competitive performance with minimal complexity?
RQ2How does incorporating a baseline and experience replay affect stability and sample efficiency?
RQ3What is the impact of replay buffer size and off-policy data on learning quality?
RQ4Can AWR scale to high-dimensional continuous control and motion imitation tasks?

主な発見

Task	TRPO	PPO	DDPG	TD3	SAC	RWR	AWR (Ours)
Ant-v2	2901±85	1161±389	72±1550	4285±671	5909±371	181±19	5067±256
HalfCheetah-v2	3302±428	4920±429	10563±382	4309±1238	9297±1206	1400±370	9136±184
Hopper-v2	1880±337	1391±304	855±282	935±489	2769±552	605±114	3405±121
Humanoid-v2	552±9	695±59	4382±423	81±17	8048±700	509±18	4996±697
LunarLander-v2	104±94	121±49	-	-	-	185±23	229±2
Walker2d-v2	2765±168	2617±362	401±470	4212±427	5805±587	406±64	5813±483

AWR achieves competitive results compared to popular on-policy and off-policy methods on OpenAI Gym benchmarks.
AWR significantly outperforms purely on-policy methods like PPO and TRPO in both sample efficiency and asymptotic performance.
AWR attains similar asymptotic performance to SAC and TD3 on many tasks, despite using simple supervised regression for both value and policy updates.
The baseline V(s) and experience replay are crucial; removing them degrades performance, and larger replay buffers improve stability and final performance.
AWR effectively handles fully off-policy learning from static datasets in motion imitation tasks, matching or surpassing RWR and PPO under certain conditions.
On challenging Humanoid-V2, AWR still lags behind SAC, indicating room for improvement on very difficult tasks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。