QUICK REVIEW

[论文解读] Advanced Skills by Learning Locomotion and Local Navigation End-to-End

Nikita Rudin, David Hoeller|arXiv (Cornell University)|Jan 1, 2022

Robotic Locomotion and Control被引用 2

一句话总结

本文提出了一种用于腿足机器人的端到端深度强化学习方法，通过直接优化在时间限制内到达目标，而非跟踪速度指令，训练单一策略以同时学习运动与局部导航。该方法使真实四足机器人能够实现更敏捷、更节能且更自然的行为，如动态跳跃与攀爬，在复杂地形上的成功率达到更高，优于基于速度跟踪的基线方法。

ABSTRACT

The common approach for local navigation on challenging environments with legged robots requires path planning, path following and locomotion, which usually requires a locomotion control policy that accurately tracks a commanded velocity. However, by breaking down the navigation problem into these sub-tasks, we limit the robot's capabilities since the individual tasks do not consider the full solution space. In this work, we propose to solve the complete problem by training an end-to-end policy with deep reinforcement learning. Instead of continuously tracking a precomputed path, the robot needs to reach a target position within a provided time. The task's success is only evaluated at the end of an episode, meaning that the policy does not need to reach the target as fast as possible. It is free to select its path and the locomotion gait. Training a policy in this way opens up a larger set of possible solutions, which allows the robot to learn more complex behaviors. We compare our approach to velocity tracking and additionally show that the time dependence of the task reward is critical to successfully learn these new behaviors. Finally, we demonstrate the successful deployment of policies on a real quadrupedal robot. The robot is able to cross challenging terrains, which were not possible previously, while using a more energy-efficient gait and achieving a higher success rate.

研究动机与目标

克服传统导航流水线将运动与导航分解为独立任务并施加刚性约束的局限性。
通过移除速度跟踪约束，使腿足机器人能够学习复杂的动态行为，如跳跃、攀爬和自适应步态选择。
通过训练统一策略以探索完整解空间，提升在复杂地形上的能效与成功率。
在真实四足机器人（ANYmal）上实现该方法在多样化、高难度环境中的泛化部署。
证明时间依赖的最终奖励信号塑造对学习复杂行为至关重要。

提出的方法

端到端训练单一深度强化学习策略，将状态观测映射为动作指令，目标是在时间限制内抵达目标位置。
仅在每个回合结束时定义任务奖励，基于最终与目标的距离及耗时，而非持续的速度跟踪。
使用密集、稀疏和塑造后的奖励，惩罚距离与时间，并引入时间依赖的塑造成分以引导学习。
采用课程学习调度，逐步增加目标距离与地形复杂度，以提升训练稳定性。
在真实ANYmal机器人上部署策略，使用学习到的执行器模型模拟串联弹性执行器，并对力矩施加物理极限限制。
部署期间通过操纵杆或位置目标控制机器人，策略能自然响应变化的指令，且无需为此类输入进行微调。

实验结果

研究问题

RQ1与传统的速度跟踪方法相比，对导航与运动控制进行单一策略的端到端训练，是否能带来更敏捷、更具适应性的行为？
RQ2时间依赖的最终奖励塑造如何影响腿足机器人中复杂行为的涌现？
RQ3在仿真中训练的策略是否能泛化到包含跳跃、攀爬等动态动作的真实世界任务？
RQ4移除速度跟踪约束后，策略是否能发现更节能的步态并提升在困难地形上的成功率？
RQ5为何策略仅学会朝一个方向行走？该方向偏差如何缓解？

主要发现

与基于速度跟踪的基线相比，端到端策略在楼梯、0.55米高的箱子以及0.6米宽的缝隙等复杂地形上取得了更高的成功率。
机器人成功执行了动态动作，如跃过缝隙和高速攀爬楼梯，展示了以往基于速度跟踪方法无法实现的行为。
策略学习到了更节能的步态，偏离了速度跟踪方法中常用的标准快走步态，表现出更自然、更有机的运动形态。
时间依赖的奖励塑造对成功训练至关重要；若无此设计，策略无法学习复杂行为。
尽管泛化成功，策略仍表现出方向性偏差，因损失曲面中的局部极小值而仅学会朝一个方向行走。
该方法成功实现在真实硬件上的部署，借助学习到的执行器模型与力矩裁剪，但感知与状态估计在复杂任务中仍是主要限制因素。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。