QUICK REVIEW

[论文解读] Improving Neural Language Modeling via Adversarial Training

Dilin Wang, Chengyue Gong|arXiv (Cornell University)|Jun 10, 2019

Natural Language Processing Techniques被引用 55

一句话总结

引入对抗性 MLE 训练，通过在 softmax 的输出词嵌入上添加对抗性扰动来提升泛化，并在 PTB 和 WT2 上实现新的单模型 state-of-the-art perplexities，提升基于 Transformer 的机器翻译 BLEU 分数。

ABSTRACT

Recently, substantial progress has been made in language modeling by using deep neural networks. However, in practice, large scale neural language models have been shown to be prone to overfitting. In this paper, we present a simple yet highly effective adversarial training mechanism for regularizing neural language models. The idea is to introduce adversarial noise to the output embedding layer while training the models. We show that the optimal adversarial noise yields a simple closed-form solution, thus allowing us to develop a simple and time efficient algorithm. Theoretically, we show that our adversarial mechanism effectively encourages the diversity of the embedding vectors, helping to increase the robustness of models. Empirically, we show that our method improves on the single model state-of-the-art results for language modeling on Penn Treebank (PTB) and Wikitext-2, achieving test perplexity scores of 46.01 and 38.07, respectively. When applied to machine translation, our method improves over various transformer-based translation baselines in BLEU scores on the WMT14 English-German and IWSLT14 German-English tasks.

研究动机与目标

提出正则化动机以对抗大规模神经语言模型的过拟合。
提出一个简单的对抗训练机制，聚焦于 softmax 输出嵌入。
推导出最优对抗扰动的闭式解，从而实现快速训练算法。
理论上表明该方法促进嵌入的多样性与鲁棒性。
在语言模型基准（PTB、WT2、WT103）和机器翻译（WMT14 En-De、IWSLT14 De-En）上进行实证验证，验证改进。

提出的方法

通过对模型参数最大化对数似然，同时对应用于输出嵌入的对抗扰动进行最小化来表述对抗性 MLE（式5）。
将每个目标词的最优扰动计算为闭式解 delta_i* = -epsilon h / ||h||，从而得到 AdvSoft_epsilon，使 logits 平移为 -epsilon||h||（式6–7）。
通过对对抗目标进行标准梯度上升迭代更新(theta, w)，并使用闭式解更新 delta。
将输入和输出嵌入绑定权重（权重绑定），并使用常见的训练技巧；epsilon 通过 epsilon = alpha * ||w_i|| 自适应，其中 alpha 为一个超参数。
给出理论洞见，表明对抗机制在输出嵌入之间强制实现多样性（epsilon-可识别性、距离分离）。
在语言模型上对 PTB、WT2、WT103，在翻译上对 WMT2014 En-De 和 IWSLT2014 De-En 进行评估，使用如 AWD-LSTM 和 Transformer 等基线架构，将 softmax 替换为 AdvSoft。

实验结果

研究问题

RQ1输出嵌入的对抗扰动是否能改善神经语言模型的泛化能力？
RQ2闭式解的对抗扰动是否能在不引入额外参数的情况下提供简单高效的正则化？
RQ3该方法是否促进嵌入多样性与鲁棒性，以及在各基准测试中的 perplexity 与 BLEU 的影响？
RQ4对语言模型与神经机器翻译任务同时应用时，对抗性 MLE 的表现如何？

主要发现

数据集	模型	参数	有效困惑度	测试困惑度
Penn Treebank (PTB)	AWD-LSTM + Ours	24M	57.15	55.01
Penn Treebank (PTB)	AWD-LSTM + MoS + Ours	22M	54.98	52.87
Penn Treebank (PTB)	AWD-LSTM + MoS + Partial Shuffled + Ours	22M	46.63	46.01
Wikitext-2 (WT2)	AWD-LSTM + Ours	24M	49.31	48.72
Wikitext-2 (WT2)	AWD-LSTM + MoS + Ours	22M	47.15	46.52
Wikitext-2 (WT2)	AWD-LSTM + MoS + Partial Shuffled + Ours	22M	46.63	46.01
Wikitext-103 (WT103)	4-layer QRNN (baseline)	32.0	33.0	–
Wikitext-103 (WT103)	4-layer QRNN + Ours	30.6	31.6	–
Wikitext-103 (WT103)	4-layer QRNN + Ours + Dynamic Eval	27.2	28.0	–

在 PTB（46.01）实现单模型新 state-of-the-art perplexities。
在 WT103 上，提升 QRNN 基线，动态评估下达到 28.0 测试困惑度。
在翻译中，Transformer 基线获得 BLEU 提升（En→De：28.43/29.52；De→En：33.61/35.18，适用于 Small/Base 配置）。
将对抗性 softmax 与 AWD-LSTM、MoS、Partial Shuffled 变体结合时，在 PTB 和 WT2 上超越基线。
嵌入多样性提升：最近邻距离增大、嵌入的奇异值分布更均匀；模型在 PTB/WT2 实验中表现出较低的过拟合。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。