QUICK REVIEW

[论文解读] N-gram Language Modeling using Recurrent Neural Network Estimation

Ciprian Chelba, Mohammad Norouzi|arXiv (Cornell University)|Mar 31, 2017

Topic Modeling参考文献 6被引用 33

一句话总结

本文提出使用基于LSTM的神经网络来平滑n-gram语言模型，取代传统的回退方法（如Kneser-Ney）。LSTM能够有效捕捉长距离上下文依赖，在n=13时性能接近完整的循环LSTM模型，且困惑度随n-gram阶数增加而提升，优于传统平滑技术。

ABSTRACT

We investigate the effective memory depth of RNN models by using them for $n$-gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the $n$-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM $n$-gram matches the LSTM LM performance for $n=9$ and slightly outperforms it for $n=13$. When allowing dependencies across sentence boundaries, the LSTM $13$-gram almost matches the perplexity of the unlimited history LSTM LM. LSTM $n$-gram smoothing also has the desirable property of improving with increasing $n$-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low $n$-gram orders. Experiments on the One Billion Words benchmark show that the results hold at larger scale: while LSTM smoothing for short $n$-gram contexts does not provide significant advantages over classic N-gram models, it becomes effective with long contexts ($n > 5$); depending on the task and amount of data it can match fully recurrent LSTM models at about $n=13$. This may have implications when modeling short-format text, e.g. voice search/query LMs. Building LSTM $n$-gram LMs may be appealing for some practical situations: the state in a $n$-gram LM can be succinctly represented with $(n-1)*4$ bytes storing the identity of the words in the context and batches of $n$-gram contexts can be processed in parallel. On the downside, the $n$-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM.

研究动机与目标

通过使用RNN模型对n-gram语言模型进行平滑，研究其有效记忆深度。
评估基于LSTM的模型在困惑度和可扩展性方面是否优于传统的n-gram平滑方法（如Katz和Kneser-Ney）。
探讨使用LSTM编码n-gram上下文时，训练效率、推理速度与模型性能之间的权衡。
确定神经网络平滑方法是否能随着n-gram阶数的增加而有效提升性能，而传统回退方法则不能。
评估LSTM n-gram模型在低资源或短序列应用场景（如语音搜索）中的实际可行性。

提出的方法

使用LSTM网络根据固定长度的n-gram上下文预测下一个词，替代传统的n-gram概率估计方法。
LSTM按顺序处理n-gram上下文中每个词的嵌入表示，通过隐藏状态编码上下文历史。
在LSTM单元中应用Dropout以提升泛化能力并减少训练过程中的过拟合。
使用多项式目标（软标签）而非独热向量进行训练，以提高学习效率。
在推理阶段，每个n-gram上下文的LSTM状态仅计算一次，并以4*(n-1)-字节的紧凑表示形式存储词身份信息。
实验在两种设置下进行比较：句子独立性（在<S>处重置）和跨句子边界上下文，以支持跨句上下文建模。

实验结果

研究问题

RQ1当用于平滑n-gram语言模型时，LSTM的有效记忆深度是多少？
RQ2基于LSTM的n-gram平滑方法在性能上与Kneser-Ney和Katz回退等经典方法相比如何？
RQ3与传统平滑技术不同，LSTM平滑的n-gram模型性能是否会随着n-gram阶数的增加而提升？
RQ4LSTM n-gram模型能否达到接近完整循环LSTM语言模型的性能？在何种n-gram阶数下实现？
RQ5与标准循环LSTM相比，使用LSTM-based n-gram模型在训练和推理效率方面存在哪些权衡？

主要发现

加入Dropout的LSTM在编码n-gram状态方面优于前馈网络和普通RNN模型，在Penn Treebank数据集上达到最低困惑度。
当n=9时，LSTM n-gram模型在句子独立性设置下达到与完整循环LSTM语言模型相当的性能；当n=13时，性能略优。
在允许上下文跨越句子边界的情况下，LSTM 13-gram在One Billion Words基准测试中达到49的困惑度，几乎与完整循环LSTM语言模型（48）持平。
LSTM n-gram模型随着n-gram阶数的增加持续提升性能，而Kneser-Ney或Katz回退方法在较低阶数时即趋于饱和。
在One Billion Words基准测试中，LSTM平滑在n>5时开始显现效果，并在n≈13时达到与完整循环LSTM相当的性能。
使用多项式目标而非独热标签仅带来微小收益，尤其在高阶n-gram时更为有限。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。