[Paper Review] Learning to Walk via Deep Reinforcement Learning
The paper presents a sample-efficient, entropy-regularized deep RL method that learns real-world quadrupedal walking directly on hardware with minimal hyperparameter tuning, demonstrated on Minitaur and validated on simulations.
Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning. In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. We apply this method to learning walking gaits on a real-world Minitaur robot. Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. Videos of training and the learned policy can be found on the project website.
Motivation & Objective
- Motivate end-to-end locomotion learning without explicit dynamics models or gait design.
- Develop a sample-efficient RL algorithm robust to hyperparameters for real-world robots.
- Enable automatic entropy (temperature) tuning to reduce per-task hyperparameter tuning.
- Demonstrate learning of stable locomotion gaits directly on a physical quadruped and assess robustness.
Proposed method
- Extend maximum entropy RL with an entropy-constrained objective to avoid manual tuning of the temperature parameter.
- Use dual gradient updates to automatically adjust the temperature to meet a target entropy.
- Adopt a soft actor-critic framework with two Q-functions and a stochastic Gaussian policy.
- Train asynchronously on real hardware with a data collection, motion capture reward, and a separate training pipeline.
- Evaluate on OpenAI Gym benchmarks and on the Minitaur robot in real and simulated settings.
Experimental results
Research questions
- RQ1Can entropy-constrained maximum entropy RL learn locomotion directly on real robots with minimal hyperparameter tuning?
- RQ2Does the learned policy generalize to unseen terrains and perturbations in the real world?
- RQ3How does the method perform in simulation benchmarks compared to baselines, and with fixed vs. adaptive temperature?
- RQ4What data-efficiency and robustness benefits arise from the proposed entropy adjustment mechanism?
Key findings
- The method achieves stable real-world walking on Minitaur in about two hours (≈400 rollouts).
- Across OpenAI Gym benchmarks, the approach matches or exceeds SAC performance with fixed temperature while using the same hyperparameters.
- Automated entropy adjustment reduces sensitivity to reward scale and target entropy, improving robustness across tasks.
- In simulation, the method demonstrates state-of-the-art data efficiency and robustness, including resistance to lateral perturbations up to 220 N.
- The learned gait on Minitaur is periodic and synchronized, with comparable speed to a default trot yet different joint trajectories, and generalizes to unseen obstacles and terrain (flat terrain training with obstacles).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.