QUICK REVIEW

[论文解读] ReinforceWalk: Learning to Walk in Graph with Monte Carlo Tree Search

Yelong Shen, Jianshu Chen|arXiv (Cornell University)|Feb 12, 2018

Advanced Graph Neural Networks参考文献 36被引用 19

一句话总结

ReinforceWalk 提出了一种用于图遍历的新型强化学习智能体，结合了深度循环神经网络（RNN）与蒙特卡洛树搜索（MCTS），以解决部分可观测性和稀疏奖励的问题。通过利用 MCTS 生成高奖励轨迹，并使用 Q-learning 以离策略方式更新 RNN 策略，该方法在比策略梯度基线方法更少的 rollout 次数下实现了更优的性能。

ABSTRACT

Learning to walk over a graph towards a target node for a given input query and a source node is an important problem in applications such as knowledge graph reasoning. It can be formulated as a reinforcement learning (RL) problem that has a known state transition model, but with partial observability and sparse reward. To overcome these challenges, we develop a graph walking agent called ReinforceWalk, which consists of a deep recurrent neural network (RNN) and a Monte Carlo Tree Search (MCTS). To address partial observability, the RNN encodes the history of observations and map it into the Q-value, the policy and the state value. In order to effectively train the agent from sparse reward, we combine MCTS with the RNN policy to generate trajectories with more positive rewards. From these trajectories, we update the network in an off-policy manner using Q-learning and improves the RNN policy. Our proposed RL algorithm repeatedly applies this policy improvement step to learn the entire model. At testing stage, the MCTS is also combined with the RNN to predict the target node with higher accuracy. Experiment results on several graph-walking benchmarks show that we are able to learn better policies from less number of rollouts compared to other baseline methods, which are mainly based on policy gradient method.

研究动机与目标

为解决知识图谱推理中部分可观测性和稀疏奖励下学习有效图遍历策略的挑战。
通过将 MCTS 与深度策略网络结合，提升图遍历强化学习中的样本效率。
通过使用 MCTS 生成的轨迹实现离策略训练，减少对昂贵的在线策略 rollout 的依赖。
通过将 MCTS 与训练好的 RNN 策略结合，提升推理阶段目标节点预测的准确性。
证明所提出方法在更少 rollout 次数下相比现有基于策略梯度的基线方法可实现更优性能。

提出的方法

智能体使用深度循环神经网络（RNN）编码观测历史，并将其映射为 Q 值、策略和状态值。
在训练过程中采用蒙特卡洛树搜索（MCTS）生成高奖励轨迹，提升样本效率。
通过 Q-learning 以离策略方式更新 RNN 策略，使模型可利用 MCTS 生成的轨迹，而无需依赖在线策略 rollout。
训练过程通过基于 MCTS 生成轨迹的重复策略改进步骤，迭代优化 RNN 策略。
在测试阶段，将 MCTS 与训练好的 RNN 结合，以更准确地预测目标节点。
通过 RNN 的隐藏状态保持历史感知的状态表示，显式解决部分可观测性问题。

实验结果

研究问题

RQ1将 MCTS 与基于 RNN 的策略结合，能否提升稀疏奖励下图遍历任务的样本效率？
RQ2与在线策略的策略梯度方法相比，使用 MCTS 生成轨迹的离策略训练方案在性能和样本效率方面表现如何？
RQ3在推理阶段集成 MCTS 能在多大程度上提升目标节点预测的准确性？
RQ4所提出方法是否能在图遍历基准测试中以比现有强化学习基线更少的 rollout 次数学习到有效策略？
RQ5RNN 通过编码历史观测，在图遍历任务中处理部分可观测性问题时表现如何？

主要发现

ReinforceWalk 即使在更少 rollout 次数下，学习到的策略也优于依赖策略梯度方法的基线模型。
通过在训练阶段利用 MCTS 生成高奖励轨迹，该方法在图遍历基准测试中实现了性能提升。
在 MCTS 生成轨迹上使用 Q-learning 进行离策略训练，相比在线策略方法能实现更高效的策略更新。
在推理阶段集成 MCTS 显著提升了目标节点预测的准确性。
RNN 有效编码历史观测的能力，有效缓解了图遍历中部分可观测性带来的影响。
所提方法展现出卓越的样本效率，在多个基准测试中仅需较少 rollout 次数即可达到优异性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。