[论文解读] Fast Structured Decoding for Sequence Models
本文提出一种非自回归翻译模型,配备基于 CRF 的结构化推理模块(NART-CRF 和 NART-DCRF)以建模目标词共现,在实现接近自回归准确性的同时获得显著的加速。
Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to reduce the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts. To improve then decoding consistency and reduce the inference cost at the same time, we propose to incorporate a structured inference module into the non-autoregressive models. Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. Experiments in machine translation show that while increasing little latency (8~14ms), our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.
研究动机与目标
- Motivate reducing inference latency in autoregressive sequence models without sacrificing accuracy.
- Integrate a structured inference module to capture multimodal target distributions in non-autoregressive decoding.
- Develop scalable CRF approximations suitable for large vocabularies in neural MT.
- Propose dynamic transitions to enrich CRF with positional context.
- Demonstrate state-of-the-art performance among non-autoregressive models on standard MT benchmarks.
提出的方法
- Formulate non-autoregressive translation as sequence labeling and apply a linear-chain CRF to model adjacent-token dependencies.
- Use a simple NART decoder input (padding tokens followed by eos) to simplify architecture.
- Introduce a low-rank approximation for the CRF transition matrix using two transition embeddings (E1, E2) such that M = E1 E2^T.
- Apply beam approximation to reduce CRF decoding complexity from O(n|V|^2) to O(n k^2).
- Introduce a dynamic transition M^i = E1 M_dynamic^i E2^T where M_dynamic^i depends on adjacent decoder states, enriching positional context.
- Combine CRF loss with the vanilla NART loss in training: L = L_CRF + λ L_NAR (λ = 0.5).
- Evaluate on WMT14 En-De/De-En and IWSLT14 De-En with a Transformer teacher for distillation and rescoring.
实验结果
研究问题
- RQ1Can a CRF-based structured inference module improve decoding consistency and accuracy in non-autoregressive MT by modeling local label dependencies?
- RQ2Do low-rank and beam approximations enable tractable CRF decoding for large vocabularies in NART without sacrificing performance?
- RQ3Does dynamic CRF transition improve translation quality by incorporating positional context?
- RQ4How close can NART-CRF/NART-DCRF approach autoregressive baselines in BLEU while maintaining speedups?
主要发现
| 模型 | En-De BLEU | De-En BLEU | IWSLT De-En BLEU | 延迟(ms) | 相对于 ART 的加速 |
|---|---|---|---|---|---|
| NART | 20.27 (7.14) | 22.02 (9.27) | 23.04 (10.22) | 26 | 11.1x |
| NART-CRF | 23.32 (4.09) | 25.75 (5.54) | 26.39 (6.87) | 35 | 11.1x |
| NART-CRF (rescoring 9) | 26.04 (1.37) | 28.88 (2.41) | 29.21 (4.05) | 60 | 6.45x |
| NART-CRF (rescoring 19) | 26.68 (0.73) | 29.26 (2.03) | 29.55 (3.71) | 87 | 4.45x |
| NART-DCRF | 23.44 (3.97) | 27.22 (4.07) | 27.44 (5.82) | 37 | 10.4x |
| NART-DCRF (rescoring 9) | 26.07 (1.34) | 29.68 (1.61) | 29.99 (3.27) | 63 | 6.14x |
| NART-DCRF (rescoring 19) | 26.80 (0.61) | 30.04 (1.25) | 30.36 (2.90) | 88 | 4.39x |
| CRF beam size ablation (k varies) | — | — | — | varies with k | — |
| Rescoring impact (9) | — | — | — | — | — |
- NART-CRF/NART-DCRF significantly outperform prior non-autoregressive models across benchmarks.
- On WMT14 En-De, NART-CRF achieves 26.80 BLEU (comparable to AR models, 0.61 BLEU below AR Transformer in the reported setup).
- NART-CRF/ NART-DCRF attain substantial speedups over ART (approximately 11x greddy decoding; ~4.4x with rescoring).
- Beam size experiments show k=16 already provides strong approximation; larger k yields diminishing returns.
- Dynamic transitions provide BLEU gains across En-De, De-En, and IWSLT De-En tasks (e.g., modest but consistent improvements).
- NART-CRF/NART-DCRF with rescoring maintain strong accuracy while reducing latency compared to autoregressive models.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。