QUICK REVIEW

[论文解读] Pay Less Attention with Lightweight and Dynamic Convolutions

Felix Wu, Angela Fan|arXiv (Cornell University)|Jan 29, 2019

Natural Language Processing Techniques被引用 322

一句话总结

本论文提出轻量级且动态卷积作为自注意力在序列建模中的高效替代，在翻译、语言建模和摘要任务中实现具竞争力或更优的结果，且推理速度更快。

ABSTRACT

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.

研究动机与目标

促使减少对序列模型中二次复杂度自注意力的依赖。
提出具有逐通道分离结构和 softmax 归一化权重的轻量卷积。
引入生成随时间步特定内核的动态卷积。
在机器翻译、语言建模和抽象摘要任务上进行评估，以与自注意力基线进行比较。

提出的方法

开发 LightConv：一种深度可分离、softmax 归一化且权重共享的卷积，具有固定的上下文窗口。
引入 DynamicConv：一个随时间步从当前输入生成的内核，能够实现时间变化的上下文加权。
在类似 Transformer Big 的编码器-解码器架构中使用基于 GLU 的模块和残差连接，将自注意力替换为 LightConv 或 DynamicConv。
在翻译、语言建模和摘要数据集上使用标准的 NLP 目标和超参数进行训练。
在 WMT En-De、WMT En-Fr、IWSLT De-En、WMT Zh-En、Billion word language modeling，以及 CNN-DailyMail 摘要任务上进行评估。

实验结果

研究问题

RQ1轻量级、固定权重的卷积是否可以在大规模翻译基准上达到或超过自注意力的性能？
RQ2动态、随时间步依赖的内核是否相对于固定轻量卷积带来额外提升？
RQ3相比自注意力，轻量和动态卷积在运行时是否更高效且不牺牲准确性？
RQ4这些方法在语言建模和抽象摘要任务上是否具有良好的泛化能力？
RQ5这些方法如何随更长的序列和更大的词汇表规模扩展？

主要发现

LightConv 在 WMT En-De 和 En-Fr 上实现具有竞争力的 BLEU，在 En-Fr 上仅落后 SOTA 0.1 BLEU。
DynamicConv 在 WMT En-De 上超越已知最佳结果 0.4 BLEU，并在 En-Fr 上达到SOTA。
在 IWSLT De-En 和 WMT Zh-En 上，轻量级和动态卷积优于或达到自注意力基线。
DynamicConv 在保持或提升准确性的同时，运行时间比强大的自注意力基线快约 20%。
在 Billion Word 语料库的语言建模中，DynamicConv 的表现与自注意力基线相当或更好。
在 CNN-DailyMail 摘要任务中，LightConv 和 DynamicConv 的表现优于自注意力基线，DynamicConv 达到最佳 Rouge 分数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。