QUICK REVIEW

[论文解读] Fast Structured Decoding for Sequence Models

Zhiqing Sun, Zhuohan Li|arXiv (Cornell University)|Oct 25, 2019

Algorithms and Data Compression参考文献 27被引用 61

一句话总结

本文提出一种非自回归翻译模型，配备基于 CRF 的结构化推理模块（NART-CRF 和 NART-DCRF）以建模目标词共现，在实现接近自回归准确性的同时获得显著的加速。

ABSTRACT

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to reduce the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts. To improve then decoding consistency and reduce the inference cost at the same time, we propose to incorporate a structured inference module into the non-autoregressive models. Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. Experiments in machine translation show that while increasing little latency (8~14ms), our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.

研究动机与目标

Motivate reducing inference latency in autoregressive sequence models without sacrificing accuracy.
Integrate a structured inference module to capture multimodal target distributions in non-autoregressive decoding.
Develop scalable CRF approximations suitable for large vocabularies in neural MT.
Propose dynamic transitions to enrich CRF with positional context.
Demonstrate state-of-the-art performance among non-autoregressive models on standard MT benchmarks.

提出的方法

Formulate non-autoregressive translation as sequence labeling and apply a linear-chain CRF to model adjacent-token dependencies.
Use a simple NART decoder input (padding tokens followed by eos) to simplify architecture.
Introduce a low-rank approximation for the CRF transition matrix using two transition embeddings (E1, E2) such that M = E1 E2^T.
Apply beam approximation to reduce CRF decoding complexity from O(n|V|^2) to O(n k^2).
Introduce a dynamic transition M^i = E1 M_dynamic^i E2^T where M_dynamic^i depends on adjacent decoder states, enriching positional context.
Combine CRF loss with the vanilla NART loss in training: L = L_CRF + λ L_NAR (λ = 0.5).
Evaluate on WMT14 En-De/De-En and IWSLT14 De-En with a Transformer teacher for distillation and rescoring.

实验结果

研究问题

RQ1Can a CRF-based structured inference module improve decoding consistency and accuracy in non-autoregressive MT by modeling local label dependencies?
RQ2Do low-rank and beam approximations enable tractable CRF decoding for large vocabularies in NART without sacrificing performance?
RQ3Does dynamic CRF transition improve translation quality by incorporating positional context?
RQ4How close can NART-CRF/NART-DCRF approach autoregressive baselines in BLEU while maintaining speedups?

主要发现

模型	En-De BLEU	De-En BLEU	IWSLT De-En BLEU	延迟（ms）	相对于 ART 的加速
NART	20.27 (7.14)	22.02 (9.27)	23.04 (10.22)	26	11.1x
NART-CRF	23.32 (4.09)	25.75 (5.54)	26.39 (6.87)	35	11.1x
NART-CRF (rescoring 9)	26.04 (1.37)	28.88 (2.41)	29.21 (4.05)	60	6.45x
NART-CRF (rescoring 19)	26.68 (0.73)	29.26 (2.03)	29.55 (3.71)	87	4.45x
NART-DCRF	23.44 (3.97)	27.22 (4.07)	27.44 (5.82)	37	10.4x
NART-DCRF (rescoring 9)	26.07 (1.34)	29.68 (1.61)	29.99 (3.27)	63	6.14x
NART-DCRF (rescoring 19)	26.80 (0.61)	30.04 (1.25)	30.36 (2.90)	88	4.39x
CRF beam size ablation (k varies)	—	—	—	varies with k	—
Rescoring impact (9)	—	—	—	—	—

NART-CRF/NART-DCRF significantly outperform prior non-autoregressive models across benchmarks.
On WMT14 En-De, NART-CRF achieves 26.80 BLEU (comparable to AR models, 0.61 BLEU below AR Transformer in the reported setup).
NART-CRF/ NART-DCRF attain substantial speedups over ART (approximately 11x greddy decoding; ~4.4x with rescoring).
Beam size experiments show k=16 already provides strong approximation; larger k yields diminishing returns.
Dynamic transitions provide BLEU gains across En-De, De-En, and IWSLT De-En tasks (e.g., modest but consistent improvements).
NART-CRF/NART-DCRF with rescoring maintain strong accuracy while reducing latency compared to autoregressive models.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。