QUICK REVIEW

[论文解读] Preparing for the Unknown: Learning a Universal Policy with Online System Identification

Wenhao Yu, Jie Tan|arXiv (Cornell University)|Feb 8, 2017

Reinforcement Learning in Robotics参考文献 27被引用 43

一句话总结

本文提出UP-OSI，一种结合通用策略与在线系统辨识的控制框架，可在未知动态环境中实现鲁棒强化学习。通过使用模拟数据训练对动态变化敏感的策略，并实时动态估计模型参数，UP-OSI即使在未见过的动态条件下也能实现优越性能，甚至在外推设置下超越已知真实模型参数的策略。

ABSTRACT

We present a new method of learning control policies that successfully operate under unknown dynamic models. We create such policies by leveraging a large number of training examples that are generated using a physical simulator. Our system is made of two components: a Universal Policy (UP) and a function for Online System Identification (OSI). We describe our control policy as universal because it is trained over a wide array of dynamic models. These variations in the dynamic model may include differences in mass and inertia of the robots' components, variable friction coefficients, or unknown mass of an object to be manipulated. By training the Universal Policy with this variation, the control policy is prepared for a wider array of possible conditions when executed in an unknown environment. The second part of our system uses the recent state and action history of the system to predict the dynamics model parameters mu. The value of mu from the Online System Identification is then provided as input to the control policy (along with the system state). Together, UP-OSI is a robust control policy that can be used across a wide range of dynamic models, and that is also responsive to sudden changes in the environment. We have evaluated the performance of this system on a variety of tasks, including the problem of cart-pole swing-up, the double inverted pendulum, locomotion of a hopper, and block-throwing of a manipulator. UP-OSI is effective at these tasks across a wide range of dynamic models. Moreover, when tested with dynamic models outside of the training range, UP-OSI outperforms the Universal Policy alone, even when UP is given the actual value of the model dynamics. In addition to the benefits of creating more robust controllers, UP-OSI also holds out promise of narrowing the Reality Gap between simulated and real physical systems.

研究动机与目标

通过使策略在未知动态模型间实现泛化，弥合仿真与真实世界机器人控制之间的现实差距。
通过利用大规模物理仿真进行离线训练，减少对昂贵真实世界数据采集的依赖。
开发一种可在实际中适应变化或未知系统参数（如质量、摩擦力或物体惯性）的控制策略。
通过将系统辨识与策略学习解耦，并结合监督学习与强化学习组件，提升样本效率与鲁棒性。
实现对训练分布之外的动态模型参数的泛化，展示外推能力。

提出的方法

使用深度强化学习在多样化模拟动态模型上训练通用策略（UP），其中策略接收状态和动态模型参数μ作为输入。
实现一个在线系统辨识（OSI）模块，通过近期状态和动作的历史实时估计μ，基于模拟数据进行监督学习训练。
将UP与OSI整合为联合框架（UP-OSI），其中OSI在每个时间步预测μ，并将其输入策略以进行动作选择。
在OSI中使用循环或序列模型处理时间序列状态-动作历史，从而从运动序列中实现动态模型估计。
通过有限次迭代（例如五次）训练OSI，以在准确性和推理速度之间取得平衡，确保实时适用性。
解耦学习过程：UP通过强化学习训练，OSI通过模拟轨迹的监督学习训练，从而提升样本效率。

实验结果

研究问题

RQ1是否可以仅使用模拟数据训练单一控制策略，使其在广泛未知动态模型上实现泛化？
RQ2在线系统辨识能否从状态-动作历史中实时准确估计动态模型参数（如质量、摩擦力）？
RQ3通用策略与在线系统辨识的结合是否在未见动态环境下优于已知真实模型参数的通用策略？
RQ4该系统能否泛化至训练分布之外的动态模型参数？若是，其在这些情况下为何优于基线方法？
RQ5UP-OSI在无需真实世界微调的情况下，能在多大程度上缩小仿真与真实世界机器人控制之间的现实差距？

主要发现

在训练分布内，UP-OSI的性能与已知真实模型参数的通用策略（UP-true）相当，表明其具备有效的在线模型估计能力。
在模型参数超出训练范围的动态环境中，UP-OSI显著优于UP-true基线，表明其具备强大的泛化与外推能力。
通过持续更新μ的估计，系统在具有时变动态（如摩擦系数变化）的环境中保持鲁棒性与适应性。
OSI模块成功在四维空间（如倒立摆系统）中识别模型参数，表明其在中等维数参数化下的可行性。
解耦设计——系统辨识采用监督学习，策略学习采用强化学习——提升了样本效率，并相比端到端训练实现了更快收敛。
UP-OSI通过动态参数估计调节不同策略行为的能力，表明其可根据环境变化自适应地切换或融合控制策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。