QUICK REVIEW

[论文解读] Learning model-based planning from scratch

Razvan Pascanu, Yujia Li|arXiv (Cornell University)|Jul 19, 2017

Artificial Intelligence in Games参考文献 21被引用 78

一句话总结

本工作提出了基于想象的规划器（IBP），一个完全可学习的基于模型的智能体，通过想象的滚动实现、构建、评估和执行计划，在连续控制和离散迷宫任务中得到展示。

ABSTRACT

Conventional wisdom holds that model-based planning is a powerful approach to sequential decision-making. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the "Imagination-based Planner", the first model-based, sequential decision-making agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its model-based imagination. All imagined actions and outcomes are aggregated, iteratively, into a "plan context" which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex "imagination tree" by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete maze-solving task. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.

研究动机与目标

通过将想象力融入规划，推动学习如何规划，而不仅仅是学习如何行动的基于模型的规划。
展示一个完全可微分的架构，学习何时想象、如何想象，以及如何将想象的结果聚合成计划。
在具有挑战性的连续控制和离散迷宫任务上展示IBP，以学习针对任务量身定制的规划策略。
探索想象带来的计算成本，以及智能体如何在外部奖励与内部资源使用之间取得平衡。

提出的方法

将IBP定义为四个组成部分：一个决定行动还是想象的管理者，一个提出行动的控制器，一个预测结果的想象模型，以及一个汇聚内部和外部数据的记忆。
将规划表示为迭代周期的过程，其中每一步要么执行一个动作，要么想象结果，并从想象的和真实的经验中构建计划上下文。
实现三种想象策略（1步、n步和想象树），它们决定从哪个状态开始想象以及如何将想象的行动串联起来。
端到端训练模型，使用两种损失函数：外部任务损失（燃料成本+到目标的最终距离）和内部资源成本（想象成本），结合基于梯度的优化和离散路由的REINFORCE。
使用一个交互网络作为世界模型，同时用于想象动力学和预测真实状态转变，并使用基于SVG的梯度对连续动作进行优化。

实验结果

研究问题

RQ1一个完全可学习的基于模型的规划器是否能够利用想象的滚动来构建、评估和执行计划？
RQ2在规划时，智能体应如何在外部任务性能与内部计算成本之间取得平衡？
RQ3在连续和离散任务中，哪些规划策略（1步、多步，或类似树的想象）最有效？
RQ4学习得到的想象策略能否跨任务泛化并处理离散迷宫中的状态歧义？

主要发现

IBP 学会使用基于模型的想象来提升在具有挑战性的连续控制任务中的表现。
想象帮助智能体测试替代方案、串联动作，并构建用于规划的复杂想象树。
增加允许的想象步数可减少任务损失，显示在规划中向前看的价值。
在离散迷宫中，想象树策略优于1步和n步策略，在多目标情景中接近最优奖励。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。