QUICK REVIEW

[论文解读] Towards Making the Most of BERT in Neural Machine Translation

Jiacheng Yang, Mingxuan Wang|arXiv (Cornell University)|Aug 15, 2019

Topic Modeling参考文献 25被引用 31

一句话总结

本文提出 CTnmt，一种协同训练框架，通过结合渐近蒸馏、动态切换门控和速率调度学习策略，将 BERT 整合到神经机器翻译（NMT）中，以缓解灾难性遗忘问题。该方法实现了最先进性能，在 WMT14 英德翻译基准上 BLEU 分数最高提升 3.0，且在相同基准上超越先前最先进方法 1.4 BLEU。

ABSTRACT

GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (CTNMT) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed CTNMT consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show CTNMT gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pre-training aided NMT by 1.4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentence-pairs, our base model still significantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score. The code and model can be downloaded from https://github.com/bytedance/neurst/ tree/master/examples/ctnmt.

研究动机与目标

解决在资源丰富的神经机器翻译（NMT）设置中微调 BERT 时出现的灾难性遗忘问题。
克服直接将 BERT 集成到 NMT 中的局限性，后者在大规模基准（如 WMT14）上通常无法带来性能提升。
开发一个统一框架，有效融合预训练语言模型知识与 NMT 的序列到序列学习能力。
通过联合训练，在保留 BERT 的通用知识的同时适应翻译特定任务，从而提升 NMT 性能。
在大规模、高资源翻译数据集（如 WMT14 英法和英汉）上展示所提方法的有效性。

提出的方法

应用渐近蒸馏，通过最小化教师模型（预训练 BERT）与学生模型（NMT 编码器）隐藏表示之间的 L2 或交叉熵损失，将知识从 BERT 传递到 NMT 编码器。
引入一种动态切换门控机制，根据输入依赖的注意力机制，自适应地融合 BERT 编码表示与 NMT 编码器输出，实现上下文感知的特征融合。
实施一种速率调度学习策略，分别控制 BERT 和 NMT 组件的学习率，防止过拟合并保留预训练知识。
以端到端方式联合训练包含蒸馏、动态门控和调度学习三个组件的 NMT 模型，且不增加额外参数。
使用 BERT 的最后一层作为初始编码器表示，同时允许 NMT 编码器在联合训练过程中学习任务特定特征。
通过包含 NMT 损失和蒸馏损失的多任务目标函数优化联合模型，确保翻译质量与知识保留的双重保障。

实验结果

研究问题

RQ1在高资源 NMT 环境中，能否在不发生灾难性遗忘的情况下有效微调预训练 BERT？
RQ2如何在翻译任务中联合优化 BERT 的上下文表示能力与 NMT 的序列建模优势？
RQ3在保留 BERT 预训练知识的同时适应 NMT 任务，哪种训练策略最为有效？
RQ4与固定融合或直接替换嵌入相比，BERT 与 NMT 编码器特征的动态融合是否能带来更好的性能？
RQ5与统一微调相比，为 BERT 和 NMT 组件分别设置调度学习率是否能提升收敛速度与最终性能？

主要发现

在 WMT14 英德翻译基准上，CTnmt 实现了 3.0 BLEU 的分数提升，超越先前最先进方法 1.4 BLEU。
在包含 4000 万句对的大型 WMT14 英法数据集上，CTnmt 的表现优于最先进 Transformer-big 模型超过 1.0 BLEU。
在 WMT14 英汉基准上，CTnmt 实现了 1.6 BLEU 的性能增益，表明其在多种语言对中均具有一致的性能提升。
渐近蒸馏有效保留了 BERT 的预训练知识，表现为微调过程中的性能稳定。
动态切换门控实现了更优的表示融合，尤其在 BERT 或 NMT 单独表现不佳的句子上效果显著。
速率调度学习策略通过解耦 BERT 与 NMT 组件的更新速度，显著提升了模型收敛性与最终性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。