QUICK REVIEW

[論文レビュー] Differentiable Dynamic Programming for Structured Prediction and Attention

Arthur Mensch, Mathieu Blondel|arXiv (Cornell University)|Feb 11, 2018

Reinforcement Learning in Robotics参考文献 24被引用数 55

ひとこと要約

この論文は、最大演算子を強い凸正則化子で平滑化して動的計画法を微分可能にするフレームワークを導入し、微分可能なDPレイヤーとエンドツーエンド学習を可能にする。平滑化されたViterbiと平滑化されたDTWの実装を含む。

ABSTRACT

Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.

研究の動機と目的

Provide a unified method to transform a broad class of dynamic programs into differentiable operators.
Show that smoothed DP operators are convex relaxations of original DP and derive interpretable gradients as expected paths.
Derive two instantiations: a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment.
Demonstrate differentiable DP layers in neural networks for structured prediction and structured attention.

提案手法

Define max Omega as a smoothed maximum over the probability simplex using a strongly convex regularizer Omega.
Form a smoothed DP recursion DP_Omega by replacing max with max_Omega in the Bellman-like recurrence, yielding a differentiable, convex operator.
Show that DP_Omega is a relaxation of LP and analyze bounds relating LP and DP_Omega, with special cases where Omega corresponds to entropic or squared L2 regularization.
Provide a backpropagation scheme to compute gradients ∇DP_Omega and Hessian-vector products ∇^2 DP_Omega Z efficiently in O(|E|) time on the DP graph.
Interpret ∇DP_Omega as an expected path under a certain random walk on the DP graph, offering a probabilistic perspective and linking to CRF-like distributions when using negentropy.
Detail how to backprop through both DP_Omega and ∇DP_Omega for differentiable layers in neural nets.

実験結果

リサーチクエスチョン

RQ1How can a broad class of dynamic programming algorithms be made differentiable while preserving structure?
RQ2What are the theoretical and practical implications of smoothing the max operator in DP, and how does this relate to graphical model inference?
RQ3How can we instantiate the framework for concrete problems like sequence prediction (Viterbi) and time-series alignment (DTW)?
RQ4Can we backpropagate through both the DP value and its gradient to enable end-to-end learning of all components?
RQ5What is the role of regularizers (e.g., negentropy vs squared L2) in shaping the solution and sparsity of gradients?

主な発見

DP_Omega provides a smooth, convex relaxation of the original dynamic program, enabling differentiable layers.
The gradient ∇DP_Omega equals an expected path under a distribution defined by a local random walk on the DP graph.
As the regularization strength gamma goes to zero, ∇DP_{gamma Omega} converges to a subgradient of the original LP and recovers the hard DP solution when appropriate.
Using negentropy regularization recovers CRF-like behavior and, with squared L2 regularization, yields sparser gradient distributions.
The framework yields two concrete instantiations: Vit_Omega (smoothed Viterbi) for sequence labeling and DTW_Omega for time-series alignment, with backpropagation through both value and gradient.
The proposed differentiable DP layers support end-to-end learning in structured prediction tasks and in structured attention mechanisms for neural machine translation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。