[论文解读] Iterative Reinforcement Learning Based Design of Dynamic Locomotion Skills for Cassie
Presents DASS-based iterative design combining supervised imitation with policy-gradient RL to enable rapid, robust, variable-speed walking policies for Cassie, transferable from simulation to hardware without dynamics randomization.
Deep reinforcement learning (DRL) is a promising approach for developing legged locomotion skills. However, the iterative design process that is inevitable in practice is poorly supported by the default methodology. It is difficult to predict the outcomes of changes made to the reward functions, policy architectures, and the set of tasks being trained on. In this paper, we propose a practical method that allows the reward function to be fully redefined on each successive design iteration while limiting the deviation from the previous iteration. We characterize policies via sets of Deterministic Action Stochastic State (DASS) tuples, which represent the deterministic policy state-action pairs as sampled from the states visited by the trained stochastic policy. New policies are trained using a policy gradient algorithm which then mixes RL-based policy gradients with gradient updates defined by the DASS tuples. The tuples also allow for robust policy distillation to new network architectures. We demonstrate the effectiveness of this iterative-design approach on the bipedal robot Cassie, achieving stable walking with different gait styles at various speeds. We demonstrate the successful transfer of policies learned in simulation to the physical robot without any dynamics randomization, and that variable-speed walking policies for the physical robot can be represented by a small dataset of 5-10k tuples.
研究动机与目标
- Enable an iterative DRL design process that allows full reward-function redefinition at each iteration while limiting deviation from the prior policy.
- Introduce Deterministic Action Stochastic State (DASS) tuples to reconstruct and distill policies from few samples.
- Combine imitation learning from DASS with RL via a soft constraint to refine policies toward new objectives while staying close to expert behavior.
- Demonstrate transfer of simulation-trained policies to the physical Cassie robot without dynamics randomization.
- Show policy compression and distillation to smaller networks without sacrificing robustness.
提出的方法
- Define DASS as a dataset of (state, expert mean action) pairs collected from states visited by a stochastic policy under the expert’s action distribution.
- Solve supervised learning J_sp(θ)=E_{s∼D}[(m_θ(s)−m_e(s))^2] to recover a policy from limited samples.
- Formulate total objective J_total = J_rl − w J_sp to softly constrain RL updates with imitation data.
- Update θ using θ_{t+1} = θ_t + α(∇_θ J_rl − w ∇_θ J_sp) to blend policy gradients with supervised learning.
- Use large, fixed-covariance Gaussian policies to inject noise during training for robustness and distillation ease.
- Demonstrate policy training on a high-fidelity Cassie simulator (MuJoCo) with Proximal Policy Optimization, then transfer to the physical robot without dynamics randomization.
实验结果
研究问题
- RQ1Can an iterative RL design framework with DASS data collection support redefinition of reward functions across design iterations while bounding deviation from prior policies?
- RQ2How effectively can DASS-based imitation be combined with policy-gradient RL to produce robust, variable-speed locomotion policies?
- RQ3Does transferring policies learned in simulation to Cassie without dynamics randomization yield stable walking across multiple gaits and speeds?
- RQ4What is the role of policy compression and distillation in preserving robustness when moving to smaller networks?
- RQ5Can multiple specialized policies be distilled into a single policy capable of multiple locomotion styles?
主要发现
- Demonstrated stable walking with different gait styles and speeds on Cassie using policies learned in simulation and transferred to hardware without dynamics randomization.
- Small DASS datasets (5–10k tuples) suffice to reconstruct robust variable-speed walking policies on hardware.
- Combining RL with DASS-based imitation enables exploring new reward functions while staying close to prior policies, avoiding forgetting.
- Larger neural networks accelerate RL learning and yield more robust policies; distilled policies can perform comparably on hardware when compressed to smaller networks (e.g., 16×16 to 64×64 hidden layers).
- Iterative design with changing rewards can produce smoother pelvis motion and stable stepping across speeds, including forward and backward walking, and can handle unmodeled disturbances on hardware.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。