QUICK REVIEW

[论文解读] Learning Permutations with Sinkhorn Policy Gradient

Patrick Emami, Sanjay Ranka|arXiv (Cornell University)|May 18, 2018

Machine Learning and Algorithms参考文献 37被引用 40

一句话总结

本文提出 Sinkhorn Policy Gradient (SPG)，一种在置换矩阵上学习策略的策略梯度方法，使用带温度控制的 Sinkhorn 层提供可微松弛，从而在 actor-critic 框架下实现端到端训练。

ABSTRACT

Many problems at the intersection of combinatorics and computer science require solving for a permutation that optimally matches, ranks, or sorts some data. These problems usually have a task-specific, often non-differentiable objective function that data-driven algorithms can use as a learning signal. In this paper, we propose the Sinkhorn Policy Gradient (SPG) algorithm for learning policies on permutation matrices. The actor-critic neural network architecture we introduce for SPG uniquely decouples representation learning of the state space from the highly-structured action space of permutations with a temperature-controlled Sinkhorn layer. The Sinkhorn layer produces continuous relaxations of permutation matrices so that the actor-critic architecture can be trained end-to-end. Our empirical results show that agents trained with SPG can perform competitively on sorting, the Euclidean TSP, and matching tasks. We also observe that SPG is significantly more data efficient at the matching task than the baseline methods, which indicates that SPG is conducive to learning representations that are useful for reasoning about permutations.

研究动机与目标

为解是置换的组合问题激发学习算法。
开发一个可微且端到端可训练的置换矩阵策略。
通过 Sinkhorn 层将状态表示学习与结构化的置换行动空间解耦。
在排序、最大权重匹配（MWM）和欧几里得 TSP 上展示数据效率和竞争力性能。

提出的方法

引入 SPG，一种在置换空间 P_N 中的离策略确定性策略梯度方法。
将置换放松为连续的双随机矩阵，使用带温度控制的 Sinkhorn 层，从而实现可微的策略梯度。
采用 actor-critic 架构，其中 actor 输出一个 Doubly-Stochastic Matrix M；通过 Hungarian 四舍五入获得最近的置换 P，同时梯度绕过 P。
加入一个 critic 惩罚项，使离散动作和连续动作的 Q 值对齐，以减少放松偏差。
使用回放缓冲区和受 GRASP 启发的通过 k-exchange 扰动的探索策略，以及 epsilon-greedy 探索进行训练。
提供消融研究和在排序、最大权重匹配（MWM）和欧几里得 TSP 上的实验，以展示数据效率和性能提升。

实验结果

研究问题

RQ1SPG 能否在排序、MWM 和 TSP 任务中有效地学习置换矩阵上的策略？
RQ2带温度控制的 Sinkhorn 放松是否使置换策略能够端到端的可微训练？
RQ3critic 惩罚项是否能减少来自连续放松的偏差并提高学习稳定性？
RQ4与基线模型在置换基任务上的数据效率对比，SPG 的数据效率如何？

主要发现

SPG 在排序、MWM 和欧几里得 TSP 上学得具有竞争力的解。
随着问题规模增大，SPG 在匹配任务上比基线方法具有更高的数据效率。
critic 惩罚项有助于对齐软（连续）与硬（离散）Q 值，减少放松带来的偏差，并在饱和前提供更长的学习期。
更小的 Sinkhorn 温度 tau 会带来较高的平均奖励，但方差略有增大，在 tau=0.05 以下回报增益递减。
将 GRASP 风格扰动和 epsilon-greedy 探索相结合的探索策略在各任务上都表现稳健。
SPG+Matching 学会对置换基任务的有效表征，并且在规模方面比你的基线 RL 解码器方法更具可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。