QUICK REVIEW

[论文解读] Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning

Anusha Nagabandi, Ignasi Clavera|arXiv (Cornell University)|Mar 30, 2018

Robotic Locomotion and Control被引用 300

一句话总结

本文提出 GrBAL 与 ReBAL，这是基于元学习的模型基强化学习在线自适应方法，使模型对动态、现实世界环境的快速、样本高效的自适应成为可能，且包括一个真实的腿式微型机器人。

ABSTRACT

Although reinforcement learning methods can achieve impressive results in simulation, the real world presents two major challenges: generating samples is exceedingly expensive, and unexpected perturbations or unseen situations cause proficient but specialized policies to fail at test time. Given that it is impractical to train separate policies to accommodate all situations the agent may see in the real world, this work proposes to learn how to quickly and effectively adapt online to new tasks. To enable sample-efficient learning, we consider learning online adaptation in the context of model-based reinforcement learning. Our approach uses meta-learning to train a dynamics model prior such that, when combined with recent data, this prior can be rapidly adapted to the local context. Our experiments demonstrate online adaptation for continuous control tasks on both simulated and real-world agents. We first show simulated agents adapting their behavior online to novel terrains, crippled body parts, and highly-dynamic environments. We also illustrate the importance of incorporating online adaptation into autonomous agents that operate in the real world by applying our method to a real dynamic legged millirobot. We demonstrate the agent's learned ability to quickly adapt online to a missing leg, adjust to novel terrains and slopes, account for miscalibration or errors in pose estimation, and compensate for pulling payloads.

研究动机与目标

动机：在现实世界的强化学习中，当扰动或新地形导致动力学变化时，需要快速的在线自适应。
开发一个样本高效的元学习框架，利用最近的经验在线自适应动力学模型。
提出两种实现：GrBAL（基于梯度）和 ReBAL（基于递归），用于神经动力学模型的在线自适应。
在带有动态扰动的仿真连续控制任务以及真实的腿式毫微机器人上进行评估，以展示其实用性。

提出的方法

采用模型基的强化学习，具有可通过元学习快速自适应的神经动力学模型。
元训练优化基础模型参数集合和更新机制，使过去的经验能够促进快速自适应。
两种更新机制：GrBAL 使用类似 MAML 的基于梯度的更新；ReBAL 使用递归网络来学习其自身的更新规则。
自适应利用过去的 M 个时间步来预测下一个 K 个时间步，并更新参数以最小化负对数似然。
使用经过适应的模型进行 MPPI（模型预测路径积分控制）规划，并在每个时间步重新规划。
训练与测试流程包括在元训练过程中进行在线自适应，以提供策略梯度数据。

实验结果

研究问题

RQ1经过在线自适应后，适应后的动力学模型是否能改变以改进对近未来动力学的预测？
RQ2GrBAL 与 ReBAL 是否能够对剧烈的动力学变化和未知环境实现快速在线自适应？
RQ3在样本效率和性能方面，基于模型的元强化学习与基于模型无关的元强化学习以及基线 MB 方法相比如何？
RQ4在多样化任务中，GrBAL 或 ReBAL 哪一个提供更好的泛化和快速自适应？
RQ5在线自适应在真实机器人上是否可行且有益？

主要发现

自适应使预测误差从更新前到更新后降低，表明在线自适应有效。
用1.5-3小时的真实世界数据对GrBAL/ReBAL进行元训练，其性能优于或等同于用约1000×更多数据训练的无模型代理。
在需要快速适应的若干任务情景中，GrBAL优于 MB+DE 和 MB oracle。
在真实机器人实验中，GrBAL 展现出对地形变化、校准误差以及载荷的在线自适应，针对腿式微型机器人。
总体而言，在测试环境中，GrBAL 比 ReBAL 在快速自适应和泛化方面表现更好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。