QUICK REVIEW

[论文解读] Adaptive Input Representations for Neural Language Modeling

Alexei Baevski, Michael Auli|arXiv (Cornell University)|Sep 28, 2018

Topic Modeling参考文献 29被引用 55

一句话总结

该论文在输入表示上扩展自适应 Softmax 的自适应输入嵌入，显示在 Wikitext-103 和 Billion Word 基准上训练更快、困惑度更好。

ABSTRACT

We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. We perform a systematic comparison of popular choices for a self-attentional architecture. Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the Billion Word benchmark, we achieve 23.02 perplexity.

研究动机与目标

通过根据词频改变输入嵌入容量来降低过拟合与参数量的动机。
提出并实现自适应输入嵌入，将词汇表按频率分成簇并按簇分配嵌入维度，在输入模型前投影到公共维度。
在自注意力架构中比较基于词、子词和字符的输入/输出分解。
在 Wikitext-103 和 Billion Word 数据集上评估训练效率和困惑度改进。

提出的方法

将自适应 Softmax 扩展到具有每个词簇可变容量的输入表示。
将输入词汇表按频率分簇，并按簇分配嵌入维度，在进入模型前投影到公共维度。
可选地将输入嵌入与输出嵌入在自适应 Softmax 中绑定，以进一步减少参数。
在 Transformer 风格的解码器中系统性地比较不同配置下的基于词、子词和字符的输入。
使用 Nesterov 动量、余弦学习率调度，以及分布式多 GPU 设置进行训练；在自适应 Softmax 的尾部投影中应用 dropout 正则化。

实验结果

研究问题

RQ1与固定大小嵌入和字符输入相比，自适应输入嵌入是否提高语言模型性能和训练速度？
RQ2不同的输入/输出分解（词、子词、字符）对困惑度和参数效率有何影响？
RQ3在自适应设置中绑定输入和输出嵌入对性能和参数量有何影响？
RQ4罕见词与常见词的处理及正则化如何影响模型准确性？
RQ5上下文大小和训练块大小对大规模模型的困惑度有何影响？

主要发现

Input	Output	Valid Perplexity	Test Perplexity	Train Time (hours)	Parameters
SM	Embedding+Softmax	23.87	24.92	57	476.8M
BPE	BPE Embedding+BPE Softmax	23.13	24.25	30	270M
BPE-T	BPE Embedding+BPE Softmax (tied)	22.46	23.45	30	235.7M
SM-T	Embedding+Softmax (tied)	22.63	23.38	56	339.7M
ASM	Embedding+Adaptive	21.23	22.18	35	263.1M
CNN	Char-CNN+Adaptive	20.86	21.79	70	266.3M
ADP	Adaptive+Adaptive	20.95	21.74	34	291.3M
ADP-T	Adaptive+Adaptive (tied)	19.79	20.51	30	246.9M

在结合自适应 Softmax 时，自适应输入可将输入/输出参数数量降低多达 61%。
自适应输入的训练速度是基于字符输入的 CNN 基线的两倍多，同时达到更高的准确性。
在 Wikitext-103 上，最佳模型实现 18.7 的 perplexity，比以前的最佳结果低 10.5。
在 Billion Word 上，最佳模型实现 23.02 的 perplexity，相较于前沿结果有显著改进。
绑定自适应输入与输出（ADP-T）在训练速度与紧凑子词模型相当的情况下达到最高准确性。
对罕见词的正则化提升了 Wikitext-103 上的自适应 Softmax 性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。