Skip to main content
QUICK REVIEW

[论文解读] Fast Structured Decoding for Sequence Models

Zhiqing Sun, Zhuohan Li|arXiv (Cornell University)|Oct 25, 2019
Algorithms and Data Compression参考文献 27被引用 61
一句话总结

本文提出一种非自回归翻译模型,配备基于 CRF 的结构化推理模块(NART-CRF 和 NART-DCRF)以建模目标词共现,在实现接近自回归准确性的同时获得显著的加速。

ABSTRACT

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to reduce the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts. To improve then decoding consistency and reduce the inference cost at the same time, we propose to incorporate a structured inference module into the non-autoregressive models. Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. Experiments in machine translation show that while increasing little latency (8~14ms), our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.

研究动机与目标

  • Motivate reducing inference latency in autoregressive sequence models without sacrificing accuracy.
  • Integrate a structured inference module to capture multimodal target distributions in non-autoregressive decoding.
  • Develop scalable CRF approximations suitable for large vocabularies in neural MT.
  • Propose dynamic transitions to enrich CRF with positional context.
  • Demonstrate state-of-the-art performance among non-autoregressive models on standard MT benchmarks.

提出的方法

  • Formulate non-autoregressive translation as sequence labeling and apply a linear-chain CRF to model adjacent-token dependencies.
  • Use a simple NART decoder input (padding tokens followed by eos) to simplify architecture.
  • Introduce a low-rank approximation for the CRF transition matrix using two transition embeddings (E1, E2) such that M = E1 E2^T.
  • Apply beam approximation to reduce CRF decoding complexity from O(n|V|^2) to O(n k^2).
  • Introduce a dynamic transition M^i = E1 M_dynamic^i E2^T where M_dynamic^i depends on adjacent decoder states, enriching positional context.
  • Combine CRF loss with the vanilla NART loss in training: L = L_CRF + λ L_NAR (λ = 0.5).
  • Evaluate on WMT14 En-De/De-En and IWSLT14 De-En with a Transformer teacher for distillation and rescoring.

实验结果

研究问题

  • RQ1Can a CRF-based structured inference module improve decoding consistency and accuracy in non-autoregressive MT by modeling local label dependencies?
  • RQ2Do low-rank and beam approximations enable tractable CRF decoding for large vocabularies in NART without sacrificing performance?
  • RQ3Does dynamic CRF transition improve translation quality by incorporating positional context?
  • RQ4How close can NART-CRF/NART-DCRF approach autoregressive baselines in BLEU while maintaining speedups?

主要发现

模型En-De BLEUDe-En BLEUIWSLT De-En BLEU延迟(ms)相对于 ART 的加速
NART20.27 (7.14)22.02 (9.27)23.04 (10.22)2611.1x
NART-CRF23.32 (4.09)25.75 (5.54)26.39 (6.87)3511.1x
NART-CRF (rescoring 9)26.04 (1.37)28.88 (2.41)29.21 (4.05)606.45x
NART-CRF (rescoring 19)26.68 (0.73)29.26 (2.03)29.55 (3.71)874.45x
NART-DCRF23.44 (3.97)27.22 (4.07)27.44 (5.82)3710.4x
NART-DCRF (rescoring 9)26.07 (1.34)29.68 (1.61)29.99 (3.27)636.14x
NART-DCRF (rescoring 19)26.80 (0.61)30.04 (1.25)30.36 (2.90)884.39x
CRF beam size ablation (k varies)varies with k
Rescoring impact (9)
  • NART-CRF/NART-DCRF significantly outperform prior non-autoregressive models across benchmarks.
  • On WMT14 En-De, NART-CRF achieves 26.80 BLEU (comparable to AR models, 0.61 BLEU below AR Transformer in the reported setup).
  • NART-CRF/ NART-DCRF attain substantial speedups over ART (approximately 11x greddy decoding; ~4.4x with rescoring).
  • Beam size experiments show k=16 already provides strong approximation; larger k yields diminishing returns.
  • Dynamic transitions provide BLEU gains across En-De, De-En, and IWSLT De-En tasks (e.g., modest but consistent improvements).
  • NART-CRF/NART-DCRF with rescoring maintain strong accuracy while reducing latency compared to autoregressive models.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。