QUICK REVIEW

[Paper Review] Combining Q-Learning and Search with Amortized Value Estimates

Jessica B. Hamrick, Victor Bapst|arXiv (Cornell University)|Apr 30, 2020

Reinforcement Learning in Robotics46 references17 citations

TL;DR

SAVE combines Q-learning with Monte-Carlo Tree Search by using a learned prior over state-action values to guide search, which produces improved Q-estimates that are then used to update the prior. This amortizes MCTS computation, enabling faster learning and superior performance with minimal search budgets.

ABSTRACT

We introduce with Amortized Value Estimates (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and---in contrast to typical model-based search approaches---yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.

Motivation & Objective

To reduce the computational cost of model-based planning in reinforcement learning while maintaining high sample efficiency.
To improve sample efficiency and learning speed in deep reinforcement learning by combining model-free Q-learning with model-based search.
To enable strong performance with very small search budgets, overcoming a key limitation of typical model-based approaches.
To create a cooperative learning loop between model-free updates and model-based search through amortized value estimation.

Proposed method

A learned prior over state-action values is used to guide Monte-Carlo Tree Search (MCTS), improving search efficiency.
MCTS computes improved state-action value estimates based on the prior and environment dynamics.
The improved Q-estimates from MCTS are combined with real experience to update the prior network via Q-learning.
The process creates a feedback loop where search enhances learning and learning improves search guidance.
The method is modular and can be integrated into any Q-learning agent with access to a model.
Value estimates from search are amortized by reusing them across multiple learning updates, reducing per-step computation.

Experimental results

Research questions

RQ1Can combining model-free Q-learning with model-based search improve sample efficiency in reinforcement learning?
RQ2How can MCTS computation be amortized to reduce planning cost without sacrificing performance?
RQ3Can strong performance be achieved with very small search budgets using a learned prior to guide search?
RQ4Does the cooperative loop between search and learning lead to faster convergence and higher final returns?

Key findings

SAVE achieves higher cumulative rewards compared to baseline Q-learning agents across both physical reasoning tasks and Atari environments.
The method converges significantly faster, requiring fewer training steps to reach peak performance.
SAVE maintains strong performance even with very small search budgets, outperforming standard model-based approaches under such constraints.
The integration of search-derived value estimates with real experience leads to more accurate and stable Q-value estimation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.