QUICK REVIEW

[論文レビュー] Combinatorial Optimization by Graph Pointer Networks and Hierarchical Reinforcement Learning

Qiang Ma, Suwen Ge|arXiv (Cornell University)|Nov 12, 2019

Reinforcement Learning in Robotics参考文献 26被引用数 139

ひとこと要約

グラフ埋め込みを備えた Graph Pointer Networks (GPNs) を導入し、TSP に適用するとともに、TSP with Time Windows のような制約付き問題を扱う階層的強化学習フレームワーク (HGPN) を提案する。大規模なインスタンスへの一般化と、競争力のある実現可能性を示す。

ABSTRACT

In this work, we introduce Graph Pointer Networks (GPNs) trained using reinforcement learning (RL) for tackling the traveling salesman problem (TSP). GPNs build upon Pointer Networks by introducing a graph embedding layer on the input, which captures relationships between nodes. Furthermore, to approximate solutions to constrained combinatorial optimization problems such as the TSP with time windows, we train hierarchical GPNs (HGPNs) using RL, which learns a hierarchical policy to find an optimal city permutation under constraints. Each layer of the hierarchy is designed with a separate reward function, resulting in stable training. Our results demonstrate that GPNs trained on small-scale TSP50/100 problems generalize well to larger-scale TSP500/1000 problems, with shorter tour lengths and faster computational times. We verify that for constrained TSP problems such as the TSP with time windows, the feasible solutions found via hierarchical RL training outperform previous baselines. In the spirit of reproducible research we make our data, models, and code publicly available.

研究の動機と目的

学習ベースの手法で巡回セールスマン問題（TSP）と制約付きバリアントの解決を動機づける。
グラフ埋め込みを組み込んだGraph Pointer Networks (GPNs)を提案し、都市間の関係をよりよく捉える。
時間窓などの制約を扱う階層的強化学習（HGPN）を導入する。
小規模から大規模への一般化を示し、TSP with Time Windows (TSPTW)で評価する。
再現可能なコードとデータを提供し、ベンチマークとさらなる研究を促進する。

提案手法

都市間の関係を捉える点エンコーダとグラフ埋め込み層を備えたGraph Pointer Networks (GPNs)を開発する。
ベクトルコンテキスト（都市座標の差分）を用いて、より大きなTSPへの適用性を向上させる。
ポリシー勾配と中央のセルフクリティックベースラインを用いた強化学習でGPNsを訓練する。
TSPTWのような制約問題のために2層の階層GPN (HGPN)を導入し、タスクを分解して訓練を安定化させる。
下位層を訓練して実現可能性制約を強制し、上位層を目的関数の最適化に用いる、層別ポリシー最適化を使用する。
下位層のフィードバックが潜在変数を介して上位層の意思決定をバイアスする2層のHGPNアーキテクチャを提供する。

実験結果

リサーチクエスチョン

RQ1Graph Pointer Networksは、小規模から大規模なTSPインスタンスへ一般化できるか。
RQ2グラフ埋め込みとベクトルコンテキストを取り入れると、従来のポインタベースモデルより性能が向上するか？
RQ3層ごとの報酬を用いた階層RLは、TSPTWのような制約付きTSPバリアントを効果的に解決できるか？
RQ4HGPNは大規模TSPと制約付きバリアントで、古典的ソルバーや他のMLベース手法と比べてどうか？

主な発見

方法	Tour Len (TSP250)	Time (TSP250)	Tour Len (TSP500)	Time (TSP500)	Tour Len (TSP750)	Time (TSP750)	Tour Len (TSP1000)	Time (TSP1000)
LKH	11.893	9792s	16.542	23070s	20.129	36840s	23.130	50680s
Concorde	11.89	1894s	16.55	13902s	20.10	32993s	23.11	47804s
Nearest Neighbor	14.928	25s	20.791	60s	25.219	115s	28.973	136s
2-opt	13.253	303s	18.600	1363s	22.668	3296s	26.111	6153s
Farthest Insertion	13.026	33s	18.288	160s	22.342	454s	25.741	945s
OR-Tools (Savings)	12.652	5000s	17.653	5000s	22.933	5000s	28.332	5000s
OR-Tools (Christofides)	12.289	5000s	17.449	5000s	22.395	5000s	26.477	5000s
s2v-DQN	13.079	476s	18.428	1508s	22.550	3182s	26.046	5600s
Pointer Net	14.249	29s	21.409	280s	27.382	782s	32.714	3133s
Attention Model	14.032	2s	24.789	14s	28.281	42s	34.055	136s
GPN (ours)	13.679	32s	19.605	111s	24.337	232s	28.471	393s
GPN+2opt (ours)	12.942	214s	18.358	974s	22.541	2278s	26.129	4410s

グラフ埋め込みを持つGPNsは、小規模TSP（例えばTSP50）から大規模インスタンス（最大TSP1000）へ一般化し、ツアー長は競争力があり、実行時間は速い。
大規模なTSPにおいて、ベクトルコンテキストを用いるGPNは点コンテキストのものより一般化性能が改善される。
HGPNはTSP with Time Windowsでベースラインを上回り、複数の設定で実現可能性が高くコストが低い。
大規模TSPベンチマークでは、2-optリファインメントを用いたGPNベース手法（GPN+2opt）は、いくつかのOR-Tools構成を上回り、特定の設定で最先端に近づく。
実世界のTSPLIB評価では、GPN+2optが、いくつかの厳密解法よりも実行時間を大幅に短縮しつつ、競争力のあるギャップを達成。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。