QUICK REVIEW

[论文解读] Learning to select computations

Falk Lieder, Frederick Callaway|arXiv (Cornell University)|Jan 1, 2017

Reinforcement Learning in Robotics参考文献 20被引用 2

一句话总结

该论文提出了一种样本高效的强化学习算法，通过近似理性元推理来学习计算选择策略，其核心洞见是计算价值介于短视与完全信息价值之间。该方法在三种元推理任务——终止、动作选择与规划——中均实现了接近最优的性能，优于最先进的基线方法如元贪婪（meta-greedy）和受限策略（blinkered policies）。

ABSTRACT

Efficient use of limited computational resources is essential to intelligence. Selecting computations optimally according to rational metareasoning would achieve this, but rational metareasoning is computationally intractable. Inspired by psychology and neuroscience, we propose the first learning algorithm for approximating the optimal selection of computations. We derive a general, sample-efficient reinforcement learning algorithm for learning to select computations from the insight that the value of computation lies between the myopic value of computation and the value of perfect information. We evaluate the performance of our method against two state-of-the-art methods for approximate metareasoning--the meta-greedy heuristic and the blinkered policy--on three increasingly difficult metareasoning problems: metareasoning about when to terminate computation, metareasoning about how to choose between multiple actions, and metareasoning about planning. Across all three domains, our method achieved near-optimal performance and significantly outperformed the meta-greedy heuristic. The blinkered policy performed on par with our method in metareasoning about decision-making, but it is not directly applicable to metareasoning about planning where our method outperformed both the meta-greedy heuristic and a generalization of the blinkered policy. Our results are a step towards building self-improving AI systems that can learn to make optimal use of their limited computational resources to efficiently solve complex problems in real-time.

研究动机与目标

解决智能系统中有限计算资源的高效分配问题。
通过学习近似来克服理性元推理的计算不可行性。
开发一种可泛化的通用方法，适用于多种元推理问题，如计算终止、动作选择与规划。
在性能与适用性方面超越现有近似元推理方法，如元贪婪和受限策略。
实现能够自我改进的AI系统，实现实时计算资源的最优利用。

提出的方法

利用强化学习从经验中学习计算选择策略，采用一种平衡短视与完全信息价值的值函数。
将计算价值定义为介于计算的即时收益与完整信息增益之间，从而实现稳定学习。
采用样本高效的强化学习算法，在模拟的元推理任务上进行训练，以近似最优决策策略。
将学习到的策略应用于三个领域：何时终止计算、选择何种动作，以及不确定性下的规划。
使用函数逼近技术，在复杂决策空间中实现状态与动作的泛化。
通过环境交互收集的经验进行端到端训练，避免依赖手工设计的启发式规则。

实验结果

研究问题

RQ1学习到的强化学习策略是否能比现有启发式方法更好地近似最优计算选择？
RQ2所提出方法在不同元推理问题（包括终止、动作选择与规划）中的表现如何？
RQ3该方法是否能泛化到受限策略不适用的领域，如规划？
RQ4该算法在计算效率与解质量方面在多大程度上实现了接近最优的性能？
RQ5一种连接短视与完全信息价值的值函数公式，是否能带来稳定且有效的学习？

主要发现

所提方法在所有三项元推理任务（计算终止、动作选择与规划）中均实现了接近最优的性能。
在所有三个领域中，其性能显著优于元贪婪启发式方法，展现出更优的决策质量与效率。
在决策元推理中，受限策略表现相当，但所提方法在规划任务中表现出有效泛化，而受限策略则失效。
在规划任务中，该方法优于受限策略的泛化版本，凸显其更广泛的应用潜力。
样本高效的强化学习框架实现了有限经验下的稳定学习，支持实时部署。
值函数公式有效平衡了短期收益与长期信息收益，实现了稳健的策略学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。