[论文解读] Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing
本论文提出一种基于值函数的强化学习框架,使用组合动作,通过将动作选择表述为混合整数规划,并应用于带策略迭代的方法和神经值函数近似器的 CVRP。它在基线方法上取得具有竞争力的结果,并在标准 CVRP 实例上与 OR-tools 的接近水平。
Value-function-based methods have long played an important role in reinforcement learning. However, finding the best next action given a value function of arbitrary complexity is nontrivial when the action space is too large for enumeration. We develop a framework for value-function-based deep reinforcement learning with a combinatorial action space, in which the action selection problem is explicitly formulated as a mixed-integer optimization problem. As a motivating example, we present an application of this framework to the capacitated vehicle routing problem (CVRP), a combinatorial optimization problem in which a set of locations must be covered by a single vehicle with limited capacity. On each instance, we model an action as the construction of a single route, and consider a deterministic policy which is improved through a simple policy iteration algorithm. Our approach is competitive with other reinforcement learning methods and achieves an average gap of 1.7% with state-of-the-art OR methods on standard library instances of medium size.
研究动机与目标
- Motivate reinforcement learning for combinatorial optimization and address large action spaces by embedding optimization in action selection.
- Propose a policy-iteration RL framework where a neural network estimates the value function and a mixed-integer program selects the next action.
- Apply the approach to Capacitated Vehicle Routing Problem (CVRP) by reducing action choice to PC-TSP with knapsack constraints.
- Demonstrate competitiveness with baselines and OR-Tools on random and library CVRP instances, highlighting single-instance learning advantages.
提出的方法
- Represent CVRP states as binary vectors of unvisited cities and actions as feasible routes starting/ending at the depot.
- Use a small neural network with ReLU activations to approximate the value function V^π for the current policy.
- During policy improvement, select next action by minimizing C(a) + V̂(T(s,a)); solve this action-selection step as a mixed-integer program (MIP) that encodes PC-TSP with a knapsack constraint.
- In the MIP, include the value V̂(t) as a piecewise-linear term via ReLU activations, enabling standard MIP solvers to optimize over combinatorial actions.
- Augment the objective with combinatorial lower bounds LB^p(t) to tighten the MIP and improve convergence.
- Train the value network with data from policy evaluation, using retention of data across iterations and a decaying influence of older data.
实验结果
研究问题
- RQ1Can combinatorial action spaces in RL be effectively handled by embedding optimization (MIP) within the action selection step?
- RQ2How does a small neural network value function, paired with an optimization-based action selector, perform on CVRP compared to RL baselines and OR-tools?
- RQ3What is the impact of data retention, network size, and regularization on policy iteration performance for CVRP?
- RQ4Is a single-instance RL approach competitive with distribution-based RL methods for CVRP on standard benchmark instances?
主要发现
- Average gap to OR-Tools on standard CVRP library instances is 1.7% across moderately sized problems.
- RLCA method (with 16 neurons) achieves competitive results with simple neural architectures versus prior RL methods.
- For 11, 21, 51-city Random CVRP instances, RLCA outperforms greedy and matches or approaches performance of OR-Tools and, in some settings, optimal CP-SAT solutions within practical time.
- Training-time bottleneck is solving the action-selection MIP; Gurobi generally faster than SCIP, enabling faster policy iterations.
- Incorporating combinatorial lower bounds modestly improves convergence and solution quality; larger networks improve performance up to diminishing returns.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。