QUICK REVIEW

[论文解读] Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo, Zhirui Zhang|arXiv (Cornell University)|Oct 13, 2020

Topic Modeling参考文献 38被引用 40

一句话总结

本文提出 AB-Net，这是一個框架，在两个 BERT 模型（源端和目标端）中插入轻量级适配器，结合 Mask-Predict 实现并行序列解码，在解码延迟减半、参数高效的情况下获得强大的 NMT 性能。

ABSTRACT

While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and task agnostic. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT, and can be adapted to traditional autoregressive decoding easily. We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves $36.49$/$33.57$ BLEU scores on IWSLT14 German-English/WMT14 German-English translation. When adapted to autoregressive decoding, the proposed method achieves $30.60$/$43.56$ BLEU scores on WMT14 English-German/English-French translation, on par with the state-of-the-art baseline models.

研究动机与目标

利用两個预训练的 BERT 模型作为编码器和解码器，在 seq2seq 框架中使用轻量级适配器。
通过冻结 BERT 参数、仅训练适配器来缓解灾难性遗忘。
在保持条件生成的同时，应用并行解码方案（Mask-Predict）以利用 BERT 的双向上下文。"
在多任务翻译任务和多语言场景中证明相对于自回归基线的性能提升。

提出的方法

在编码器和解码器两端的每一层 BERT 插入适配器模块，并仅对适配器进行微调。
使用两个 BERT 模型（源端 Xbert 与目标端 Ybert）作为编码器/解码器在 seq2seq 设置中。
以条件掩码语言建模目标函数 L(y^m|y^r,x; Aenc, Adec) 进行训练，类似于公式 (3)。
采用 Mask-Predict 并行解码以利用 BERT 的双向上下文并实现快速推断；可选地扩展到自回归解码。
通过一个特殊的 [LENGTH] 标记来预测目标长度，并执行掩码与预测的迭代式细化解码。
可选地改变适配器结构（Aenc, Adec）和层级放置，以在性能和参数效率之间取得平衡。

实验结果

研究问题

RQ1是否可以在 seq2seq 框架中使用适配器将 BERT 同时用作编码器和解码器？
RQ2仅训练适配器模块并冻结 BERT 层是否能缓解灾难性遗忘并提高效率？
RQ3使用 Mask-Predict 的并行解码是否能带来加速并在翻译质量上与自回归基线具备竞争力？
RQ4适配器的规模和架构如何影响性能和训练效率？
RQ5该框架在多语言对及资源设置下是否有效？

主要发现

模型	De-En (IWSLT14)	Ro-En (IWSLT14)	En-De (WMT16)	De-En (WMT14)	延迟	参数
Transformer-Base	33.59	34.46	28.04	32.69	778 ms	74 M
Mask-Predict	31.71	33.31	27.03	30.53	161 ms	75 M
BERT-Fused NAT	33.14	34.12	27.73	32.10	260 ms	90 M
AB-Net	36.49	35.63	28.69	33.57	327 ms	67 M
AB-Net-Enc	34.45	-	28.08	-	165 ms	78 M

AB-Net 在 IWSLT14 De-En 上实现 36.49 BLEU，在 WMT14 De-En 上实现 33.57 BLEU，采用并行解码，优于 Mask-Predict 和自回归基线。
AB-Net 相对于 Transformer-Base，在相似的可训练参数量下将解码延迟降低约 2 倍。
BERT 双侧（编码器和解码器）版本的 AB-Net 使用的可训练参数比 BERT-Fused NAT 少，同时达到比基线更高的 BLEU。
编码器端和解码器端的适配器使模型能够利用两个 BERT 模型的信息并建模条件依赖，从而提升性能。
AB-Net-Enc（仅编码器的 BERT 与适配器）也取得强劲结果，在顶层使用适配器可在参数更少的情况下保持性能。
在低资源的 IWSLT14 语言对上，AB-Net 在 En-It、It-En、En-Es、Es-En、En-Nl、Nl-En 等任务中持续优于基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。