[Paper Review] M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search
M-Walk combines a recurrent neural network with Monte Carlo Tree Search to learn graph-walking policies for knowledge base completion, addressing sparse rewards via off-policy Q-learning with shared parameters.
Learning to walk over a graph towards a target node for a given query and a source node is an important problem in applications such as knowledge base completion (KBC). It can be formulated as a reinforcement learning (RL) problem with a known state transition model. To overcome the challenge of sparse rewards, we develop a graph-walking agent called M-Walk, which consists of a deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes the state (i.e., history of the walked path) and maps it separately to a policy and Q-values. In order to effectively train the agent from sparse rewards, we combine MCTS with the neural policy to generate trajectories yielding more positive rewards. From these trajectories, the network is improved in an off-policy manner using Q-learning, which modifies the RNN policy via parameter sharing. Our proposed RL algorithm repeatedly applies this policy-improvement step to learn the model. At test time, MCTS is combined with the neural policy to predict the target node. Experimental results on several graph-walking benchmarks show that M-Walk is able to learn better policies than other RL-based methods, which are mainly based on policy gradients. M-Walk also outperforms traditional KBC baselines.
Motivation & Objective
- Motivate learning to walk over graphs to identify target nodes given a source and a query, with applications to knowledge base completion (KBC).
- Address sparse rewards and history-dependent states by using an RNN encoder combined with Monte Carlo Tree Search (MCTS).
- Learn a policy and Q-function with shared parameters to enable off-policy policy improvement via Q-learning.
- Leverage known deterministic graph transitions to integrate model-based search (MCTS) with neural learning for improved trajectory generation.
- Evaluate M-Walk against RL baselines and traditional KBC methods on synthetic and real-world benchmarks.
Proposed method
- Introduce a graph-walking agent M-Walk that encodes the full history of traversed nodes and the query into a state representation via a GRU-based RNN encoder.
- Jointly model policy and Q-value using shared parameters, with a neural architecture that computes action scores through inner products of state and action representations.
- Use MCTS with a PUCT-like selection to generate informative trajectories from a prior policy, exploiting the deterministic, known transition model of the graph.
- Update the Q-network with off-policy Q-learning using trajectories produced by MCTS, which indirectly improves the policy due to parameter sharing.
- At test time, combine MCTS with the learned policy and Q-function to score candidate target nodes and select the highest-scoring node.
Experimental results
Research questions
- RQ1Can an RNN-encoded history plus MCTS help learn effective walks on graphs with sparse rewards in KBC tasks?
- RQ2Does sharing parameters between the Q-network and policy network enable effective off-policy policy improvement from MCTS-generated trajectories?
- RQ3How does M-Walk compare to policy-gradient RL methods and traditional KBC baselines on benchmarks like NELL995 and WN18RR?
- RQ4What is the impact of MCTS components (rollouts, horizon) on training efficiency, trajectory quality, and overall performance?
Key findings
- M-Walk learns better policies than prior RL-based methods and traditional KBC baselines on several benchmarks.
- MCTS-enabled trajectories yield more positive rewards than the neural policy alone, aiding learning in sparse-reward settings.
- The shared-parameter architecture allows off-policy Q-learning updates to improve the policy, with test-time MCTS using the improved policy.
- On NELL995 and WN18RR, M-Walk achieves strong results and outperforms several RL-based baselines and embedding-based methods in multiple metrics.
- Ablations show M-Walk’s neural architecture provides gains over MINERVA, and MCTS contributes additional improvements beyond a purely policy-gradient approach.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.