QUICK REVIEW

[论文解读] Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control

Kendall Lowrey, Aravind Rajeswaran|arXiv (Cornell University)|Nov 5, 2018

Reinforcement Learning in Robotics参考文献 36被引用 66

一句话总结

POLO 将在线轨迹优化与离线价值函数学习以及不确定性驱动的探索相结合，以在高维控制任务中实现高效、基于规划的学习。

ABSTRACT

We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world.

研究动机与目标

在复杂世界中用内部动态模型推动持续行动与学习。
展示局部轨迹优化如何与全局价值函数学习交互，以稳定并加速学习。
证明近似价值函数可以减少规划长度并提高策略质量。
开发一种探索策略，利用轨迹优化进行时间协调的探索。

提出的方法

使用基于模型的轨迹优化（MPC）根据名义动态模型计算局部最优动作序列。
应用带参数函数近似器的拟合值迭代来学习全局价值函数 V 以提供引导。
通过维护多个价值函数近似器并对它们的输出使用 softmax 形成乐观价值估计来实现含不确定性的探索。
计划通过在对价值函数后验下优化轨迹来进行探索，从而实现时间协调的探索。
为价值函数更新定义 N 步轨迹目标以加速学习并稳定训练（式 (Eq. 7)）。
迭代地收集经验，更新一个价值函数集合，然后在乐观终值条件下执行 MPC。

实验结果

研究问题

RQ1轨迹优化结合不确定性估计是否能实现时间协调的探索？
RQ2学习到的价值函数是否可以让 MPC 使用更短的规划步长而不牺牲性能？
RQ3轨迹优化在高维任务中是否能加速并稳定价值函数学习？
RQ4POLO 是否能够在真实世界经验有限的情况下解决复杂任务（例如人形行走、灵巧操作）？

主要发现

轨迹优化能够实现有目的的、时间协调的探索，提升状态空间中区域的覆盖。
在高维任务中，POLO 的规划时间比纯 MPC 更具主导性，显示出更快的技能获得和更好的表现。
更长的 MPC 路径相较于贪心策略，对价值函数近似误差更具鲁棒性。
N 步轨迹优化加速价值函数学习并稳定目标。
学习到的价值函数可以引导 MPC 即使当奖励稀疏或变化时也能实现任务进展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。