[论文解读] M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search
tldr: M-Walk 将 循环神经网络 与 Monte Carlo Tree Search 结合,以学习知识库补全的图遍历策略,通过 off-policy Q-learning 与共享参数解决稀疏奖励。
Learning to walk over a graph towards a target node for a given query and a source node is an important problem in applications such as knowledge base completion (KBC). It can be formulated as a reinforcement learning (RL) problem with a known state transition model. To overcome the challenge of sparse rewards, we develop a graph-walking agent called M-Walk, which consists of a deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes the state (i.e., history of the walked path) and maps it separately to a policy and Q-values. In order to effectively train the agent from sparse rewards, we combine MCTS with the neural policy to generate trajectories yielding more positive rewards. From these trajectories, the network is improved in an off-policy manner using Q-learning, which modifies the RNN policy via parameter sharing. Our proposed RL algorithm repeatedly applies this policy-improvement step to learn the model. At test time, MCTS is combined with the neural policy to predict the target node. Experimental results on several graph-walking benchmarks show that M-Walk is able to learn better policies than other RL-based methods, which are mainly based on policy gradients. M-Walk also outperforms traditional KBC baselines.
研究动机与目标
- Motivate learning to walk over graphs to identify target nodes given a source and a query, with applications to knowledge base completion (KBC).
- Address sparse rewards and history-dependent states by using an RNN encoder combined with Monte Carlo Tree Search (MCTS).
- Learn a policy and Q-function with shared parameters to enable off-policy policy improvement via Q-learning.
- Leverage known deterministic graph transitions to integrate model-based search (MCTS) with neural learning for improved trajectory generation.
- Evaluate M-Walk against RL baselines and traditional KBC methods on synthetic and real-world benchmarks.
提出的方法
- Introduce a graph-walking agent M-Walk that encodes the full history of traversed nodes and the query into a state representation via a GRU-based RNN encoder.
- Jointly model policy and Q-value using shared parameters, with a neural architecture that computes action scores through inner products of state and action representations.
- Use MCTS with a PUCT-like selection to generate informative trajectories from a prior policy, exploiting the deterministic, known transition model of the graph.
- Update the Q-network with off-policy Q-learning using trajectories produced by MCTS, which indirectly improves the policy due to parameter sharing.
- At test time, combine MCTS with the learned policy and Q-function to score candidate target nodes and select the highest-scoring node.
实验结果
研究问题
- RQ1Can an RNN-encoded history plus MCTS help learn effective walks on graphs with sparse rewards in KBC tasks?
- RQ2Does sharing parameters between the Q-network and policy network enable effective off-policy policy improvement from MCTS-generated trajectories?
- RQ3How does M-Walk compare to policy-gradient RL methods and traditional KBC baselines on benchmarks like NELL995 and WN18RR?
- RQ4What is the impact of MCTS components (rollouts, horizon) on training efficiency, trajectory quality, and overall performance?
主要发现
- M-Walk learns better policies than prior RL-based methods and traditional KBC baselines on several benchmarks.
- MCTS-enabled trajectories yield more positive rewards than the neural policy alone, aiding learning in sparse-reward settings.
- The shared-parameter architecture allows off-policy Q-learning updates to improve the policy, with test-time MCTS using the improved policy.
- On NELL995 and WN18RR, M-Walk achieves strong results and outperforms several RL-based baselines and embedding-based methods in multiple metrics.
- Ablations show M-Walk’s neural architecture provides gains over MINERVA, and MCTS contributes additional improvements beyond a purely policy-gradient approach.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。