QUICK REVIEW

[论文解读] Uncertainty-sensitive Learning and Planning with Ensembles

Piotr Miłoś, Łukasz Kuciński|arXiv (Cornell University)|Dec 19, 2019

AI-based Problem Solving and Planning被引用 2

一句话总结

该论文提出了一种强化学习框架，将价值函数集成与蒙特卡洛树搜索（MCTS）规划相结合，以提升稀疏奖励、高难度环境中的探索能力。通过利用集成方差建模不确定性，并应用风险敏感型泛函，该方法提升了规划效率与价值函数学习效果，在Deep-sea、Toy Montezuma’s Revenge和Sokoban基准测试中实现了更快的收敛速度和更优的性能表现。

ABSTRACT

We propose a reinforcement learning framework for discrete environments in which an agent makes both strategic and tactical decisions. The former manifests itself through the use of value function, while the latter is powered by a tree search planner. These tools complement each other. The planning module performs a local extit{what-if} analysis, which allows to avoid tactical pitfalls and boost backups of the value function. The value function, being global in nature, compensates for inherent locality of the planner. In order to further solidify this synergy, we introduce an exploration mechanism with two distinctive components: uncertainty modelling and risk measurement. To model the uncertainty we use value function ensembles, and to reflect risk we use propose several functionals that summarize the implied by the ensemble. We show that our method performs well on hard exploration environments: Deep-sea, toy Montezuma's Revenge, and Sokoban. In all the cases, we obtain speed-up in learning and boost in performance.

研究动机与目标

解决Sokoban和Deep-sea等稀疏奖励、高复杂度环境中样本效率与探索的挑战。
通过将不确定性感知的价值函数集成整合到树搜索中，提升规划的鲁棒性。
通过基于集成的不确定性建模与事后重标注（hindsight relabeling）提升价值函数学习效果。
构建一种协同框架，使规划引导探索，而价值函数弥补规划器的局限性。

提出的方法

使用一组价值网络建模认知不确定性，通过可学习的头部网络聚合预测结果。
应用风险度量（即集成方差的泛函）引导MCTS中的探索，偏好高不确定性状态。
将MCTS与基于价值函数的模拟相结合，利用规划器的搜索历史生成价值函数训练目标。
采用优先经验回放与事后重标注技术，提升价值函数学习过程中的样本效率。
在固定轨迹上训练价值函数，通过重标注从失败轨迹中生成额外的正样本。
采用混合的无模型与有模型方法，其中规划器在学习到的环境模型上运行。

实验结果

研究问题

RQ1基于集成的不确定性建模是否能提升稀疏奖励环境中的探索能力？
RQ2基于价值函数集成的风险敏感型规划如何影响学习速度与性能表现？
RQ3将规划器生成的轨迹整合到价值函数学习中，能在多大程度上提升学习效果？
RQ4在高难度探索任务中，将无模型价值学习与有模型规划相结合，是否优于单独使用任一方法？
RQ5通过集成实现的不确定性量化，是否能在Sokoban等组合复杂度高的环境中实现更有效的探索？

主要发现

该方法在Deep-sea、Toy Montezuma’s Revenge和Sokoban环境中显著加快了学习速度，并提升了性能表现。
在Sokoban迁移学习任务中，当集成规模从2个价值网络增加到3个时，性能提升了约10–12%。
采用基于集成的不确定性与风险度量，实现了更有效的探索，减少了对随机搜索的依赖。
价值函数集成优于单个网络，且性能随集成规模增大而提升。
将规划器的搜索历史整合到价值函数训练中，提升了学习效率与样本利用效率。
更大的神经网络架构（5层CNN）相比更小的架构（4层）展现出更好的泛化能力，表明在复杂任务中模型容量对泛化能力具有重要影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。