QUICK REVIEW

[論文レビュー] Non-Autoregressive Machine Translation with Disentangled Context Transformer

Jungo Kasai, James Cross|arXiv (Cornell University)|Jan 15, 2020

Natural Language Processing Techniques参考文献 43被引用数 51

ひとこと要約

この論文は、注意マスキング目的と並列のイージーファースト推論を用いて、デコードステップを削減しつつBLEUを競争的に保つ非自己回帰翻訳のためのDisEntangled Context (DisCo) トランスフォーマを導入します。

ABSTRACT

State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens. We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations. Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average. Our code is available at https://github.com/facebookresearch/DisCo.

研究の動機と目的

ニューラル機械翻訳における左から右への自己回帰デコードから離れることでデコード待機時間の削減を動機づける。
DisEntangled Context (DisCo) トランスフォーマを提案し、任意の他のターゲットトークンの部分集合を条件として各ターゲットトークンを予測する。
反復ベースの収束を用いてすべてのトークンを並列に精練する並列イージーファースト推論アルゴリズムを開発する。
DisCo が複数の言語方向とデータサイズに渡ってデコード時間を大幅に削減しつつ競争力のあるBLEUスコアを達成することを示す。

提案手法

DisCo トランスフォーマを導入し、各ターゲット位置を予測する際に観測済みトークンのみに注意を向ける注意マスキングを用いる。
DisCo 目的を定義：Y_n を X と他のターゲットトークンの任意の部分集合 Y_obs^n を条件として予測し、すべての位置の条件付き確率をワンパスで計算可能にする。
前の層からのキー/バリューをデコンテキスト化してリークを回避しつつ、DisCo 層を積み重ねて説明する。
並列デコードを可能にするため、観測トークンのランダム部分集合で訓練し、長さ予測損失を含める。
並列イージーファースト推論を提案：各反復で全位置を予測し、不確実性の増加順にトークンを更新することで、可変の反復回数を許容する。
強力な自己回帰教師からの蒸留と標準的なトランスフォーマのハイパーパラメータを利用；複数のWMT方向でBLEUで評価する。）

実験結果

リサーチクエスチョン

RQ1DisCo は disentangled context を用いた非自己回帰トランスフォーマで、最先端 NAT や自己回帰モデルと比較して競争力のBLEUを達成できるか。
RQ2DisCo 目的は効率的なワンパス条件付けと効果的な並列デコードを実現するか。
RQ3並列イージーファースト推論は BLEU と反復回数の観点で mask-predict と比較してどうか、データサイズによってどう変わるか。
RQ4データサイズと蒸留が DisCo の性能をベースラインと比較してどのように影響するか。
RQ5NAT のデコード戦略は WMT のタスクにおける速度と品質にどのように影響するか。

主な発見

Model	en→de BLEU	de→en BLEU	en→ro BLEU	ro→en BLEU	Steps (approx)
Gu et al. (2018) (CMLM)	—	—	—	—	1
Wang et al. (2019) (n=9)	—	—	—	—	1
Li et al. (2019) (n=9)	—	—	—	—	1
Ma et al. (2019) (n=30)	25.31	1	30.68	1	1
Sun et al. (2019) (n=19)	26.80	1	30.04	–	1
Ran et al. (2019)	26.51	1	31.13	1	1
Shu et al. (2020) (n=50)	25.1	–	–	–	1
Our Implementations (CMLM+Mask-Predict, 4 steps)	26.73	4	30.75	4	4
Our Implementations (CMLM+Mask-Predict,10 steps)	27.39	10	31.24	10	10
DisCo + Mask-Predict (4 steps)	25.83	4	32.22	4	4
DisCo + Mask-Predict (10 steps)	27.06	10	32.92	10	10
DisCo + Easy-First (EN→DE)	27.34	4.23	33.22	3.29	4.82
DisCo + Easy-First (EN→RO)	—	—	33.25	—	3.10

DisCo with parallel easy-first achieves competitive to better BLEU than CMLM-based Mask-Predict while using significantly fewer iterations (e.g., en→de 4.82 steps; ro→en 3.10 steps).
On EN-DE / EN-RO, DisCo+Easy-First reaches BLEU scores comparable to or better than strong NAT baselines, with large gains when data are plentiful (EN-ZH, EN-FR).
Distillation consistently benefits non-autoregressive models, with DisCo gaining more from distillation than CMLM under the same inference settings.
Decoding speed shows substantial wall-clock gains; average iterations correlate with speedup, with DisCo achieving about a 4–5x reduction in iterations versus autoregressive baselines depending on direction and setup.
DisCo with contextless keys/values can preserve performance even in autoregressive settings, suggesting broader applicability of the approach.
Training variants that more closely align training and inference (easy-first training) did not outperform random-sampling training, indicating random masking provides useful exploration.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。