QUICK REVIEW

[论文解读] Learning to Walk via Deep Reinforcement Learning

Tuomas Haarnoja, Sehoon Ha|arXiv (Cornell University)|Dec 26, 2018

Robotic Locomotion and Control被引用 42

一句话总结

论文提出了一种样本高效、带熵正则化的深度强化学习方法，在硬件上直接进行最小超参数调优的步态学习，已在 Minitaur 上证明并在仿真中验证。

ABSTRACT

Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning. In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. We apply this method to learning walking gaits on a real-world Minitaur robot. Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. Videos of training and the learned policy can be found on the project website.

研究动机与目标

在没有显式动力学模型或步态设计的情况下，推动端到端的机动学习。
开发对超参数鲁棒的样本效率高的 RL 算法，适用于现实世界的机器人。
实现自动熵（温度）调节，以减少每个任务的超参数调优。
证明在实际四足机器人上直接学习稳定的机动步态并评估鲁棒性。

提出的方法

通过带熵约束目标扩展最大熵 RL，以避免手动调谬温度参数。
使用双梯度更新自动调整温度以达到目标熵。
采用带有两个 Q 函数和一个随机高斯策略的软演员-评论家框架。
在真实硬件上异步训练，包含数据采集、运动捕捉奖励，以及一个分离的训练流程。
在 OpenAI Gym 基准以及 Minitaur 机器人上在真实与仿真环境中进行评估。

实验结果

研究问题

RQ1带熵约束的最大熵 RL 能否在真实机器人上直接学习机动且几乎无需超参数调优？
RQ2学习到的策略是否能泛化到真实世界中未见过的地形和扰动？
RQ3在仿真基准测试中与基线相比该方法的表现如何，以及固定温度与自适应温度的比较？
RQ4所提出的熵调整机制带来哪些数据效率和鲁棒性方面的好处？

主要发现

该方法在 Minitaur 上约用两小时（约 400 次 rollouts）实现稳定的现实世界步行。
在 OpenAI Gym 基准上，该方法在固定温度下使用相同超参数时的表现可与 SAC 相媲美甚至超越。
自动熵调节降低了对奖励尺度和目标熵的敏感性，提升了跨任务的鲁棒性。
在仿真中，该方法展示了最先进的数据效率和鲁棒性，包括对横向扰动高达 220 N 的抵抗。
在 Minitaur 上学习到的步态具有周期性和同步性，速度可与默认小跑相当，但关节轨迹不同，且能泛化到未见过的障碍物和地形（在平坦地形上带障碍物的训练）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。