QUICK REVIEW

[论文解读] Combining Q-Learning and Search with Amortized Value Estimates

Jessica B. Hamrick, Victor Bapst|arXiv (Cornell University)|Apr 30, 2020

Reinforcement Learning in Robotics参考文献 46被引用 17

一句话总结

SAVE 通过使用对状态-动作值的预训练先验来引导蒙特卡洛树搜索（MCTS），将 Q-learning 与 MCTS 相结合，从而生成改进的 Q 估计值，这些估计值随后用于更新先验。这种方法实现了 MCTS 计算的摊销，使学习速度更快，并在极小的搜索预算下实现更优性能。

ABSTRACT

We introduce with Amortized Value Estimates (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and---in contrast to typical model-based search approaches---yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.

研究动机与目标

在保持高样本效率的同时，降低基于模型规划在强化学习中的计算成本。
通过将无模型 Q-learning 与基于模型的搜索相结合，提升深度强化学习中的样本效率和学习速度。
实现在极小搜索预算下的优异性能，克服典型基于模型方法的关键局限。
通过摊销的价值估计，在无模型更新与基于模型的搜索之间建立协作学习循环。

提出的方法

使用对状态-动作值的预训练先验来引导蒙特卡洛树搜索（MCTS），提升搜索效率。
MCTS 基于先验和环境动态，计算改进的状态-动作值估计。
将 MCTS 得到的改进 Q 估计值与真实经验结合，通过 Q-learning 更新先验网络。
该过程形成一个反馈循环：搜索提升学习效果，而学习又改善搜索引导。
该方法具有模块化特性，可集成到任何具备模型访问权限的 Q-learning 代理中。
通过在多个学习更新中复用搜索得到的价值估计，实现价值估计的摊销，降低每步计算成本。

实验结果

研究问题

RQ1将无模型 Q-learning 与基于模型的搜索相结合，能否提升强化学习中的样本效率？
RQ2如何对 MCTS 计算进行摊销，以在不牺牲性能的前提下降低规划成本？
RQ3能否通过使用学习到的先验引导搜索，在极小的搜索预算下实现优异性能？
RQ4搜索与学习之间的协作循环是否能带来更快的收敛速度和更高的最终回报？

主要发现

在物理推理任务和 Atari 环境中，SAVE 在累积奖励方面均优于基线 Q-learning 代理。
该方法收敛速度显著更快，达到峰值性能所需的训练步数更少。
即使在极小的搜索预算下，SAVE 仍能保持优异性能，优于标准的基于模型方法在相同约束下的表现。
将搜索导出的价值估计与真实经验结合，可实现更准确、更稳定的 Q 值估计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。