QUICK REVIEW

[論文レビュー] An Actor-Critic Algorithm for Sequence Prediction

Dzmitry Bahdanau, Philémon Brakel|arXiv (Cornell University)|Jul 24, 2016

Multimodal Machine Learning Applications参考文献 40被引用数 224

ひとこと要約

この論文は、シーケンス生成モデルの訓練のための actor-critic フレームワークを導入し、critic がトークンの値を予測して BLEU のようなテスト時指標を改善し、綴り訂正と機械翻訳タスクでMLEとREINFORCEを上回る。

ABSTRACT

We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-truth tokens. We address this problem by introducing a \textit{critic} network that is trained to predict the value of an output token, given the policy of an \textit{actor} network. This results in a training procedure that is much closer to the test phase, and allows us to directly optimize for a task-specific score such as BLEU. Crucially, since we leverage these techniques in the supervised learning setting rather than the traditional RL setting, we condition the critic network on the ground-truth output. We show that our method leads to improved performance on both a synthetic task, and for German-English machine translation. Our analysis paves the way for such methods to be applied in natural language generation tasks, such as machine translation, caption generation, and dialogue modelling.

研究の動機と目的

タスク固有のスコアを最大化する訓練を促進する。
訓練とテストの不一致を、モデル生成するプレフィックスを条件づけて対処する。
現在のポリシーの下で各トークンの価値を予測するクリティックネットワークを導入する。
綴り訂正と機械翻訳タスクで標準の MLE および REINFORCE を上回る改善を示す。

提案手法

シーケンス生成を、アクター（デコーダ）とクリティックを持つ確率的ポリシーとして定式化する。
部分列と候補アクション（トークン）に対する値関数 V と Q を定義する。
クリティックを時系列差分ターゲットで訓練し、ターゲットネットワークと遅延アクターで安定化させる。
Q 推定を組み込んだ無偏推定による方策勾配を用い、任意の対数尤度勾配項を追加する。
報酬形成を適用して中間のフィードバックを提供し、まばらな報酬を減らす。
共同の actor-critic 訓練に入る前に、actor と critic の事前訓練を行い、学習をブートストラップする。

実験結果

リサーチクエスチョン

RQ1Actor-critic 訓練は、MLE や REINFORCE と比べて BLEU などのタスク固有のシーケンススコアを改善できるか？
RQ2訓練時にクリティックへ真の情報を組み込むことは、テスト時にそれを使用せず訓練を助けるか？
RQ3安定性と性能のために、ターゲットネットワーク、報酬形成、値ペナルティなど、どの訓練テクニックが不可欠か？
RQ4合成綴り訂正データと実データの MT データセット（IWSLT、WMT）で、ベースラインと比べて方法の性能はどうか？

主な発見

Actor-critic 訓練は、複数の設定で綴り訂正において対数尤度訓練より改善をもたらす。
IWSLT 2014 および WMT14 の MT タスクで、actor-critic 手法はベースラインより BLEU の利得を達成し、グリーディデコーディングで顕著な利得を示し、ビーム探索と依然競合的である。
安定した学習とより良い性能のために、ターゲットネットワークの使用とクリティック出力の分散ペナルティが不可欠である。
報酬形成と遅延アクターが追加の性能向上に寄与する。
本手法は、MIXER のような従来の RL ベース手法と比較して、より強力または同等のベースラインの下で競争力のある、あるいは優れた結果を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。