QUICK REVIEW

[Paper Review] Wasserstein Robust Reinforcement Learning

Mohammed Amin Abdullah, Hang Ren|arXiv (Cornell University)|Jul 30, 2019

Reinforcement Learning in Robotics46 references37 citations

TL;DR

WR2L formulates robust reinforcement learning as a min–max game with an epsilon-Wasserstein constraint around a reference dynamics, and provides a scalable zero-order solver for high-dimensional, continuous tasks.

ABSTRACT

Reinforcement learning algorithms, though successful, tend to over-fit to training environments hampering their application to the real-world. This paper proposes $ ext{W} ext{R}^{2} ext{L}$ -- a robust reinforcement learning algorithm with significant robust performance on low and high-dimensional control tasks. Our method formalises robust reinforcement learning as a novel min-max game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zero-order optimisation method that we believe can be useful to numerical optimisation in general. We empirically demonstrate significant gains compared to standard and robust state-of-the-art algorithms on high-dimensional MuJuCo environments.

Motivation & Objective

Motivate robustness in RL to improve generalisation when transition dynamics vary.
Introduce WR2L as a generic min–max framework with Wasserstein constraints.
Enable robustness in continuous state-action spaces without hand-crafted disturbance models.
Provide a scalable solver that alternates between updating dynamics and policy.

Proposed method

Define a robust RL objective as max_theta min_phi E_tau~p_theta^phi[R_total(tau)].
Constrain allowed transition perturbations to an epsilon-Wasserstein ball around a reference dynamics P0.
Parametrize policy pi_theta and perturbation dynamics phi; solve via alternating optimization.
Use average (rather than pointwise) Wasserstein constraints to make the constraint tractable.
Develop a second-order Taylor-based Hessian approximation to efficiently update phi within the constraint.
Present a zero-order (gradient-free) method for updating dynamics when gradients are unavailable.

Experimental results

Research questions

RQ1How can we formulate robust RL to handle model perturbations in continuous state-action spaces?
RQ2Can Wasserstein distance provide a principled, geometry-aware robustness constraint for RL transitions?
RQ3Is it possible to efficiently solve the resulting min–max problem without explicit dynamics models?
RQ4Does the proposed WR2L framework improve robustness and performance on high-dimensional control tasks?

Key findings

WR2L achieves significant robust performance improvements on high-dimensional MuJoCo environments compared to standard and some robust baselines.
The algorithm accommodates both discrete and continuous state-action spaces within a unified Wasserstein-based framework.
A novel zero-order optimization method enables scalable updates to transition dynamics without requiring gradient information.
The Hessian-based constraint approximation allows tractable optimization under an epsilon-Wasserstein ball around reference dynamics.
The approach does not require learning a full dynamics model, leveraging a differentiable simulator or solver with parameterisable dynamics.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.