QUICK REVIEW

[论文解读] Lite Transformer with Long-Short Range Attention

Zhanghao Wu, Zhijian Liu|arXiv (Cornell University)|Apr 24, 2020

Topic Modeling被引用 130

一句话总结

Lite Transformer 引入两分支的 Long-Short Range Attention (LSRA) 用于分别建模局部和全局上下文，在受限计算条件下提供适用于移动的 NLP，并在 BLEU 上超越 Transformer。它还实现了显著的模型尺寸缩减，并在没有显著设计成本的情况下超过了 AutoML 搜索基线。

ABSTRACT

Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.

研究动机与目标

在严格计算约束下，推动边缘设备上的高效 NLP 推理。
设计一个轻量级 Transformer 架构，在 500M Mult-Adds 以内仍能保持或提升性能。
引入 LSRA 用专门的局部和全局分支替代瓶颈注意力。
证明 LSRA 能通过剪枝/量化实现模型压缩，达到显著的尺寸缩减。
在移动环境下，将性能和成本与基于 AutoML 的基线（Evolved Transformer）进行比较。

提出的方法

提出具有两个并行分支的 Long-Short Range Attention (LSRA)：一个全局注意力分支和一个局部卷积分支。
将输入通道分割以供给这两个分支，然后通过 FFN 融合，有效地将每个分支的计算量减半。
通过展平通道维度来强调模型容量中的注意力，替换 Transformer 块中的传统瓶颈。
在局部分支中使用轻量级卷积模块（深度卷积式、参数高效）以捕捉局部上下文。
在 MT（IWSLT、WMT）以及额外任务（抽象式摘要、语言建模）上，在移动约束预算（≤500M Mult-Adds）下训练并评估 Lite Transformer。
与 Transformer 基线和 Evolved Transformer 进行比较，并通过剪枝和量化分析压缩效果。

实验结果

研究问题

RQ1在移动资源约束下，LSRA 是否能在不牺牲 MT 和语言任务性能的前提下提高基于 Transformer 的模型的效率？
RQ2在类似计算预算下，Lite Transformer 在 MT、摘要生成和语言建模上的表现如何，相对于标准 Transformer 和基于 AutoML 的基线？
RQ3将 Lite Transformer 与标准压缩技术（剪枝、量化）结合对模型大小和性能的影响是什么？

主要发现

Lite Transformer 在移动设置下，在关键 MT 基准测试中实现了相对于 Transformer 的 BLEU 提升：在 WMT En-De 的 500M Mult-Adds 下提升 +1.2 BLEU，在 100M Mult-Adds 下提升 +1.7 BLEU；在 WMT En-Fr 的 100M Mult-Adds 下提升 +1.7 BLEU，在 500M Mult-Adds 下提升 +1.2 BLEU。
在 IWSLT De-En 上，Lite Transformer 在约 100M Mult-Adds 时以约 1.6 BLEU 超越 Transformer 基线。
与 Transformer 基线相比，Lite Transformer 在 CNN-DailyMail 摘要任务上将计算量降低至约 2.4 倍，在 ~500M Mult-Adds 的语言建模中困惑度降低约 1.8 倍。
与剪枝和 8-bit 量化结合时，模型尺寸可压缩达 18.2x，且在 WMT En-Fr 上 BLEU 损失可忽略。
相较于基于 AutoML 的 Evolved Transformer，在移动设置下，Lite Transformer 在 WMT En-De 上高出 0.5 BLEU，且没有大规模的搜索成本（GPU 年和 CO2 排放）。
总体而言，LSRA 的全局与局部上下文专门化提高了移动 NLP 的效率和可扩展性，同时保持或超越基线性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。