QUICK REVIEW

[Paper Review] Primal Wasserstein Imitation Learning

Robert Dadashi, Léonard Hussenot|arXiv (Cornell University)|Jun 8, 2020

Reinforcement Learning in Robotics61 references41 citations

TL;DR

PWIL minimizes the Wasserstein distance between expert and agent state-action distributions using a primal formulation, deriving an offline reward to achieve near-expert imitation with few demonstrations and without minmax training.

ABSTRACT

Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.

Motivation & Objective

Motivate imitation learning when reward signals are hard to specify or sparse.
Propose a principled distance-based objective using the primal Wasserstein distance between state-action distributions.
Derive an offline reward function from an upper bound of the primal Wasserstein distance to guide learning.
Demonstrate sample-efficient recovery of expert behavior on continuous control tasks, including challenging Humanoid scenarios.
Show applicability to both feature-based and pixel-based (visual) observations.

Proposed method

Formulate imitation as minimizing the 1-Wasserstein distance between empirical state-action distributions of the agent and expert.
Introduce a greedy coupling to obtain an online computable upper bound on the Wasserstein distance.
Define an episodic, history-dependent reward r as a monotone function of the greedy cost c_i, computed from distances to expert transitions.
Provide an algorithm PWIL that uses the offline-derived reward with a generic RL agent.
Use standardized Euclidean distance on concatenated state-action vectors, optionally learned or learned-from-pixels, to define the distance d(.).
Demonstrate scalability and stability by avoiding a minmax training loop typical of adversarial IL methods.

Experimental results

Research questions

RQ1Does PWIL recover expert behavior across MuJoCo locomotion tasks with varying numbers of demonstrations?
RQ2How sample-efficient is PWIL relative to state-of-the-art IL methods like DAC and BC?
RQ3Does PWIL actually minimize the Wasserstein distance between expert and agent state-action distributions?
RQ4Can PWIL extend to visual/pixel-based observations where the MDP metric is learned offline?
RQ5What is the impact of ablations on PWIL’s ability to reproduce expert behavior?

Key findings

PWIL achieves near-expert performance on multiple MuJoCo tasks, including Humanoid, even from a single demonstration.
PWIL demonstrates competitive or superior Wasserstein-distance minimization to the expert across environments, compared to DAC, with tighter distance reductions in most cases.
The offline-derived reward function requires only two hyperparameters and benefits from a simple reward formulation, reducing tuning effort.
Ablation studies show that components like action-based matching (PWIL-state) and proper MD metric weighting crucially affect performance, with “pop-outs” being important for recovering full expert behavior.
PWIL extends to pixel-based observations by learning a distance in an embedding space (via Temporal Cycle-Consistency Learning) and still recovers task success in the door-opening scenario.
PWIL shows strong sample efficiency, notably solving Humanoid with meaningful scores using few demonstrations, and demonstrates robustness across seeds and environments.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.