[论文解读] Insertion Transformer: Flexible Sequence Generation via Insertion Operations
Insertion Transformer 通过在任意位置插入标记来生成序列,从而实现全自回归和并行插入解码,在 WMT14 En-De 上的 BLEU 竞争力以及接近对数级的解码迭代次数。
We present the Insertion Transformer, an iterative, partially autoregressive model for sequence generation based on insertion operations. Unlike typical autoregressive models which rely on a fixed, often left-to-right ordering of the output, our approach accommodates arbitrary orderings by allowing for tokens to be inserted anywhere in the sequence during decoding. This flexibility confers a number of advantages: for instance, not only can our model be trained to follow specific orderings such as left-to-right generation or a binary tree traversal, but it can also be trained to maximize entropy over all valid insertions for robustness. In addition, our model seamlessly accommodates both fully autoregressive generation (one insertion at a time) and partially autoregressive generation (simultaneous insertions at multiple locations). We validate our approach by analyzing its performance on the WMT 2014 English-German machine translation task under various settings for training and decoding. We find that the Insertion Transformer outperforms many prior non-autoregressive approaches to translation at comparable or better levels of parallelism, and successfully recovers the performance of the original Transformer while requiring only logarithmically many iterations during decoding.
研究动机与目标
- 激发灵活的序列生成,超越从左到右的自回归。
- 引入一种基于插入的解码框架,能够在当前画布中的任意位置插入标记。
- 展示模型支持顺序插入和并行(多位置)插入,并实现端到端训练。
- 在 WMT 2014 英语-德语上进行评估,以与自回归和非自回归基线进行比较。
提出的方法
- 在当前画布 ŷ 中的位置 l 处插入 c 的 p(c, l | x, ŷ) 模型 (Equation 1)。
- 通过添加标记和相邻解码输出的拼接,修改 Transformer 解码器以为所有插入位置产生槽位表示。
- 探索内容-位置分布(联合 p(c, l) 或分解的 p(c|l)p(l))和上下文化词汇偏置。
- 研究通过从左到右、平衡二叉树和均匀(最大熵)损失的不同终止方案(槽终止 vs 序列终止)进行的训练顺序。
- 描述自回归(一次插入一个)和并行解码(每步多次插入)流程。
- 解决由于非单向状态更新(插入后重新计算解码器状态)以及来自步采样的方差所带来的训练时挑战。
实验结果
研究问题
- RQ1Insertion-based generation can match Transformer-level quality with substantially fewer decoding iterations under parallel decoding?
- RQ2How do different generation orders (left-to-right, balanced binary tree, uniform) affect learning signals and decoding efficiency?
- RQ3What termination and loss strategies (slot finalization vs sequence finalization) optimize BLEU and convergence?
- RQ4How do architectural variants (joint vs conditional content-location, contextual vocabulary bias, mixture-of-softmaxes) impact performance?
主要发现
- Insertion-based generation can match Transformer-level quality with substantially fewer decoding iterations under parallel decoding.
- On WMT 2014 En–De, baseline greedy decoding with binary-tree loss achieves development BLEU around 21.02, improving with EOS penalties and distillation.
- With knowledge distillation from a Transformer teacher, BLEU improvements of about 3–4 points were observed across baselines.
- Best model using binary-tree loss with distillation and EOS tuning reaches development BLEU of 25.80; parallel decoding can achieve comparable or slightly better BLEU (e.g., 27.41 on development with parallel binary-tree).
- Parallel decoding yields comparable or higher BLEU than greedy decoding for several configurations, demonstrating that logarithmic-iteration generation is practical (≈log2 n) for typical sequence lengths.
- On newstest2014 test set, the insertion-transformer with parallel decoding reaches BLEU near 27.4, competitive with autoregressive and non-autoregressive baselines while requiring far fewer iterations.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。