QUICK REVIEW

[Paper Review] Neural Combinatorial Optimization with Reinforcement Learning

Irwan Bello, Hieu Pham|arXiv (Cornell University)|Nov 29, 2016

Metaheuristic Optimization Algorithms Research278 citations

TL;DR

The paper presents Neural Combinatorial Optimization, using a pointer-network-based policy trained with reinforcement learning (policy gradients) to solve TSP on 2D Euclidean graphs and knapsack, with pretraining and active search strategies achieving near-optimal results. It demonstrates RL-based methods outperform supervised-learning approaches, and introduces inference-time search variants to improve solution quality.

ABSTRACT

This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Despite the computational expense, without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack, another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items.

Motivation & Objective

Motivate a learning-based approach to combinatorial optimization that generalizes across problem sizes.
Develop a neural architecture that can output valid permutations without ground-truth labels.
Demonstrate effectiveness on 2D Euclidean TSP and knapsack, and compare to classical solvers.
Explore training strategies (RL pretraining and active search) to improve solution quality.

Proposed method

Use a pointer network with encoder-decoder LSTMs and a attention-based pointing mechanism to model p(pi|s).
Factorize the tour probability via p(pi|s)=Product p(pi(i)|pi(<i),s) with non-parameteric softmax modules (pointer network).
Train with policy gradients (REINFORCE) to minimize expected tour length using a baseline to reduce variance.
Introduce a critic (baseline network) to estimate expected tour length for a given input and guide learning (actor-critic).
Employ two inference-time search strategies: sampling from the stochastic policy and an active search procedure that updates policy parameters on a single test instance.
Discuss generalization to other problems and illustrate with knapsack as a case study.

Experimental results

Research questions

RQ1Can a neural network with a pointer architecture learn good heuristics for combinatorial optimization without supervised optimal labels?
RQ2Does reinforcement learning with pretraining plus active search outperform supervised learning baselines on TSP and knapsack?
RQ3What are effective inference-time strategies to close the gap to optimal solutions?
RQ4How well does the approach generalize to variable problem sizes beyond the training instance size?
RQ5Can the framework be adapted to other combinatorial tasks by altering the reward and feasibility handling?

Key findings

RL-based training substantially improves over supervised learning for TSP compared to prior work.
The method achieves close-to-optimal results on 2D Euclidean TSP graphs up to 100 nodes given sufficient compute.
Applied to knapsack, the approach attains optimal solutions for instances with up to 200 items.
Active Search and RL pretraining-Sampling are the most competitive inference strategies, with trade-offs between speed and solution quality.
Greedy decoding is fast but inferior; sampling and active search can yield near-optimal tours with additional computation.
Inference-time search can be stopped early with small losses in quality for faster runtimes.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.