QUICK REVIEW

[论文解读] Combinatorial Optimization by Graph Pointer Networks and Hierarchical Reinforcement Learning

Qiang Ma, Suwen Ge|arXiv (Cornell University)|Nov 12, 2019

Reinforcement Learning in Robotics参考文献 26被引用 139

一句话总结

引入 Graph Pointer Networks (GPNs) 及其用于 TSP 的图嵌入，再加一个处理像带时间窗的 TSP 这类受约束问题的分层 RL 框架（HGPN），展示对大规模实例的泛化能力并在可行性方面具有竞争力。

ABSTRACT

In this work, we introduce Graph Pointer Networks (GPNs) trained using reinforcement learning (RL) for tackling the traveling salesman problem (TSP). GPNs build upon Pointer Networks by introducing a graph embedding layer on the input, which captures relationships between nodes. Furthermore, to approximate solutions to constrained combinatorial optimization problems such as the TSP with time windows, we train hierarchical GPNs (HGPNs) using RL, which learns a hierarchical policy to find an optimal city permutation under constraints. Each layer of the hierarchy is designed with a separate reward function, resulting in stable training. Our results demonstrate that GPNs trained on small-scale TSP50/100 problems generalize well to larger-scale TSP500/1000 problems, with shorter tour lengths and faster computational times. We verify that for constrained TSP problems such as the TSP with time windows, the feasible solutions found via hierarchical RL training outperform previous baselines. In the spirit of reproducible research we make our data, models, and code publicly available.

研究动机与目标

以学习方法来推动求解 Traveling Salesman Problem（TSP）及其受约束变体。
提出 Graph Pointer Networks (GPNs)，通过引入图嵌入来更好地捕捉城市之间的关系。
引入分层强化学习（HGPN）以处理如时间窗等约束。
展示从小型到大规模 TSP 实例的泛化能力，并在带时间窗的 TSP（TSPTW）上进行评估。
提供可复现的代码和数据，促进基准测试与进一步研究。

提出的方法

开发 Graph Pointer Networks (GPNs)，包含一个点编码器和一个图嵌入层来捕获城市之间的关系。
使用向量上下文（城市坐标之间的差异）以提高对更大 TSP 的可迁移性。
使用策略梯度和中心自我评论基线来训练 GPNs。
引入一个两层的分层 GPN (HGPN) 以解决如 TSPTW 等受约束的问题，将任务分解并稳定训练。
训练下层以强制可行性约束，上层以优化目标，采用分层策略优化。
提供一个两层 HGPN 架构，其中下层反馈通过潜变量偏置上层决策。

实验结果

研究问题

RQ1Graph Pointer Networks 能否从小规模 TSP 实例推广到大规模实例？
RQ2将图嵌入和向量上下文引入是否比以往的基于指针的模型有更好表现？
RQ3带层的 RL 配合层特定奖励是否能有效解决如 TSPTW 等受约束的 TSP 变体？
RQ4在大规模 TSP 和受约束变体上，HGPN 与经典求解器及其他基于 ML 的方法相比有何差异？

主要发现

GPNs 通过图嵌入可从小型 TSP（如 TSP50）泛化到更大实例（多达 TSP1000），在路线长度竞争力和运行时间上表现良好。
带向量上下文的 GPNs 在更大规模的 TSP 上优于仅有点上下文的模型，提升泛化性。
HGPN 在带时间窗的 TSP 上优于基线，在多个设定中获得更高的可行性和更低的成本。
在大规模 TSP 基准上，基于 GPN 的方法结合 2-opt 精炼（GPN+2opt）可超越部分 OR-Tools 配置，在某些设定接近最先进水平。
真实世界 TSPLIB 评估显示 GPN+2opt 达到具有竞争力的间隙，同时运行时间显著低于某些精确求解器。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。