QUICK REVIEW

[论文解读] Regularizing and Optimizing LSTM Language Models

Stephen Merity, Nitish Shirish Keskar|arXiv (Cornell University)|Aug 7, 2017

Topic Modeling参考文献 31被引用 468

一句话总结

这篇论文介绍 AWD-LSTM（带权重丢弃的 LSTM）和 NT-ASGD，以正则化和优化 LSTM 语言模型，在 Penn Treebank 和 WikiText-2 上达到 state-of-the-art 的困惑度，并且在顶层加入神经缓存可进一步提升。

ABSTRACT

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.

研究动机与目标

为过参数化的 RNN 提供有效正则化的动机，而不修改 LSTM 实现。
提出带权重丢弃的 LSTM（对隐藏到隐藏权重进行 DropConnect）用于循环正则化。
研究优化策略，特别是 NT-ASGD，以改善对正则化 LSTMs 的训练。
探索扩展正则化（可变长度的 BPTT、嵌入 dropout、AR/TAR、权重绑定）以提高数据效率和泛化。
在 PTB 和 WT2 上评估以建立 state-of-the-art 的困惑度并评估神经缓存的增益。

提出的方法

引入带权重丢弃的 LSTM，对循环权重矩阵施加 DropConnect，以在不修改 LSTM 内部实现的情况下对循环连接进行正则化。
使用 NT-ASGD，即非单调触发的平均化 SGD 变体，设定固定学习率，以提高训练稳定性和性能。
在训练过程中应用可变长度的时序反向传播以更有效地利用数据。
对模型的不同部分实现嵌入 dropout 和变分 dropout。
采用嵌入和 softmax 权重绑定以减少参数并改善泛化。
在最终 LSTM 层输出上应用激活正则化 (AR) 和时序激活正则化 (TAR)。

实验结果

研究问题

RQ1在不修改 LSTM 实现的前提下，通过对隐藏到隐藏权重的 DropConnect 进行循环正则化，是否能改善词级语言模型的泛化？
RQ2NT-ASGD 是否在训练正则化 LSTMs 进行语言建模时，提供相较于标准 SGD/ASGD 的实用性与性能收益？
RQ3扩展正则化技术（可变长度 BPTT、嵌入 dropout、AR/TAR、权重绑定）对 PTB 与 WT2 的困惑度有何影响？
RQ4神经缓存如何与 AWD-LSTM 相互作用，进一步降低 PTB 与 WT2 的困惑度？

主要发现

AWD-LSTM 使用 Vanilla LSTM 在 Penn Treebank（57.3）和 WikiText-2（65.8）上达到 state-of-the-art 的词级困惑度。
在 AWD-LSTM 顶层应用神经缓存可进一步将困惑度提升至 52.8（PTB）和 52.0（WT2）。
NT-ASGD 以非单调平均触发在这些正则化 LSTM 的训练中优于以 SGD 为基础的训练。
扩展正则化技术（可变长度 BPTT、嵌入 dropout、AR/TAR、权重绑定）对困惑度和数据利用率有显著提升。
权重丢弃 LSTM（对循环权重的 DropConnect）是关键因素；移除它会导致困惑度大幅上升（高达 11 点）。
在 NT-ASGD 之后微调 ASGD 能带来额外增益；移除这一步会降低性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。