Skip to main content
QUICK REVIEW

[论文解读] Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing

Arthur Delarue, Ross Anderson|arXiv (Cornell University)|Oct 22, 2020
Reinforcement Learning in Robotics参考文献 37被引用 46
一句话总结

本论文提出一种基于值函数的强化学习框架,使用组合动作,通过将动作选择表述为混合整数规划,并应用于带策略迭代的方法和神经值函数近似器的 CVRP。它在基线方法上取得具有竞争力的结果,并在标准 CVRP 实例上与 OR-tools 的接近水平。

ABSTRACT

Value-function-based methods have long played an important role in reinforcement learning. However, finding the best next action given a value function of arbitrary complexity is nontrivial when the action space is too large for enumeration. We develop a framework for value-function-based deep reinforcement learning with a combinatorial action space, in which the action selection problem is explicitly formulated as a mixed-integer optimization problem. As a motivating example, we present an application of this framework to the capacitated vehicle routing problem (CVRP), a combinatorial optimization problem in which a set of locations must be covered by a single vehicle with limited capacity. On each instance, we model an action as the construction of a single route, and consider a deterministic policy which is improved through a simple policy iteration algorithm. Our approach is competitive with other reinforcement learning methods and achieves an average gap of 1.7% with state-of-the-art OR methods on standard library instances of medium size.

研究动机与目标

  • Motivate reinforcement learning for combinatorial optimization and address large action spaces by embedding optimization in action selection.
  • Propose a policy-iteration RL framework where a neural network estimates the value function and a mixed-integer program selects the next action.
  • Apply the approach to Capacitated Vehicle Routing Problem (CVRP) by reducing action choice to PC-TSP with knapsack constraints.
  • Demonstrate competitiveness with baselines and OR-Tools on random and library CVRP instances, highlighting single-instance learning advantages.

提出的方法

  • Represent CVRP states as binary vectors of unvisited cities and actions as feasible routes starting/ending at the depot.
  • Use a small neural network with ReLU activations to approximate the value function V^π for the current policy.
  • During policy improvement, select next action by minimizing C(a) + V̂(T(s,a)); solve this action-selection step as a mixed-integer program (MIP) that encodes PC-TSP with a knapsack constraint.
  • In the MIP, include the value V̂(t) as a piecewise-linear term via ReLU activations, enabling standard MIP solvers to optimize over combinatorial actions.
  • Augment the objective with combinatorial lower bounds LB^p(t) to tighten the MIP and improve convergence.
  • Train the value network with data from policy evaluation, using retention of data across iterations and a decaying influence of older data.

实验结果

研究问题

  • RQ1Can combinatorial action spaces in RL be effectively handled by embedding optimization (MIP) within the action selection step?
  • RQ2How does a small neural network value function, paired with an optimization-based action selector, perform on CVRP compared to RL baselines and OR-tools?
  • RQ3What is the impact of data retention, network size, and regularization on policy iteration performance for CVRP?
  • RQ4Is a single-instance RL approach competitive with distribution-based RL methods for CVRP on standard benchmark instances?

主要发现

  • Average gap to OR-Tools on standard CVRP library instances is 1.7% across moderately sized problems.
  • RLCA method (with 16 neurons) achieves competitive results with simple neural architectures versus prior RL methods.
  • For 11, 21, 51-city Random CVRP instances, RLCA outperforms greedy and matches or approaches performance of OR-Tools and, in some settings, optimal CP-SAT solutions within practical time.
  • Training-time bottleneck is solving the action-selection MIP; Gurobi generally faster than SCIP, enabling faster policy iterations.
  • Incorporating combinatorial lower bounds modestly improves convergence and solution quality; larger networks improve performance up to diminishing returns.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。