QUICK REVIEW

[论文解读] Metacontrol for Adaptive Imagination-Based Optimization

Jessica B. Hamrick, Andrew J. Ballard|arXiv (Cornell University)|May 7, 2017

Explainable Artificial Intelligence (XAI)参考文献 1被引用 47

一句话总结

本文提出了一种元控制器，通过自适应地管理基于想象的优化，决定运行多少轮迭代以及在每一步中咨询哪些预测模型（专家），在性能与计算成本之间实现平衡。该元控制器通过无模型强化学习进行训练，能够根据任务难度和专家可靠性动态分配资源，从而降低总成本（任务损失 + 计算成本），在复杂动力学任务中优于固定策略方法。

ABSTRACT

Many machine learning systems are built to solve the hardest examples of a particular task, which often makes them large and expensive to run---especially with respect to the easier examples, which might require much less computation. For an agent with a limited computational budget, this "one-size-fits-all" approach may result in the agent wasting valuable computation on easy examples, while not spending enough on hard examples. Rather than learning a single, fixed policy for solving all instances of a task, we introduce a metacontroller which learns to optimize a sequence of "imagined" internal simulations over predictive models of the world in order to construct a more informed, and more economical, solution. The metacontroller component is a model-free reinforcement learning agent, which decides both how many iterations of the optimization procedure to run, as well as which model to consult on each iteration. The models (which we call "experts") can be state transition models, action-value functions, or any other mechanism that provides information useful for solving the task, and can be learned on-policy or off-policy in parallel with the metacontroller. When the metacontroller, controller, and experts were trained with "interaction networks" (Battaglia et al., 2016) as expert models, our approach was able to solve a challenging decision-making problem under complex non-linear dynamics. The metacontroller learned to adapt the amount of computation it performed to the difficulty of the task, and learned how to choose which experts to consult by factoring in both their reliability and individual computational resource costs. This allowed the metacontroller to achieve a lower overall cost (task loss plus computational cost) than more traditional fixed policy approaches. These results demonstrate that our approach is a powerful framework for using rich forward models for efficient model-based reinforcement learning.

研究动机与目标

为了解决固定策略强化学习系统在简单样本上浪费计算资源、在困难样本上计算不足的低效问题。
开发一种元控制器，自适应地控制内部模拟（想象），以在计算资源受限条件下优化决策过程。
基于可靠性与成本，实现对多样化、低/高成本专家（如状态转移模型、价值函数）的动态选择与调度。
通过学习何时停止想象并采取行动，最小化总成本（定义为任务损失加上计算成本）。
证明元控制能够实现比传统固定序列策略更高效、更适应任务的规划。

提出的方法

元控制器是一个无模型强化学习智能体，负责决定何时停止想象，并在每一轮迭代中选择咨询哪个专家。
它使用循环神经网络来保持对过去决策和状态的记忆，从而实现对想象轨迹的序列推理。
专家包括诸如交互网络（IN）和多层感知机（MLP）等预测模型，用于评估候选动作并提供反馈。
元控制器通过一个权衡成本超参数（ponder cost hyperparameter）学习一种策略，以在专家准确性和计算成本之间取得平衡。
训练过程通过并行使用离策略和在线策略更新，联合优化元控制器、控制器和专家。
系统执行迭代式想象：元控制器选择一个专家，控制器提出控制动作，专家进行评估，重复此过程直至元控制器决定停止。

实验结果

研究问题

RQ1元控制器能否学会在多个预测模型之间动态分配计算资源，以最小化总成本？
RQ2自适应的专家选择与迭代次数控制如何改善在困难与简单决策任务中的性能？
RQ3元控制器能否在缺乏先验知识的情况下，学会平衡多样化专家的可靠性与计算成本？
RQ4基于想象的优化结合元控制是否在复杂非线性动力学中优于固定序列策略？
RQ5元控制器的行为如何随任务难度和专家质量的变化而改变？

主要发现

与固定策略基线相比，元控制器将总成本降低了20%至40%，在计算量更少的情况下实现了更好的性能。
平均而言，元控制器每个任务使用3至5轮想象迭代，难度较高的示例使用更多轮次，而简单示例则使用较少轮次。
当准确性至关重要时，元控制器成功优先选择了高可靠性专家；当性能已足够时，则切换到成本更低的专家。
在使用两个专家（IN和MLP）的情况下，元控制器相比仅使用单一专家或固定策略，将总成本降低了30%。
该系统在非线性及复杂相互作用的动力学场景中表现出鲁棒性，这些场景由交互网络建模。
超参数调优表明，权衡成本（$\tau$）显著影响速度与准确性的权衡，最优值位于$10^{-4}$至$10^{-3}$范围内。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。