[论文解读] MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning
MUSE 引入并行多尺度注意力,结合自注意力、深度卷积和逐点前馈网络,以在序列到序列任务中更好地建模全局、局部和令牌级上下文,在主流翻译数据集上实现最先进的 BLEU 分数。
In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention can model extremely long dependencies, the attention in deep layers tends to overconcentrate on a single token, leading to insufficient use of local information and difficultly in representing long sequences. In this work, we explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. To this end, we propose the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple. MUSE-simple contains the basic idea of parallel multi-scale sequence representation learning, and it encodes the sequence in parallel, in terms of different scales with the help from self-attention, and pointwise transformation. MUSE builds on MUSE-simple and explores combining convolution and self-attention for learning sequence representations from more different scales. We focus on machine translation and the proposed approach achieves substantial performance improvements over Transformer, especially on long sequences. More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space. Under common setting, the proposed model achieves substantial performance and outperforms all previous models on three main machine translation tasks. In addition, MUSE has potential for accelerating inference due to its parallelism. Code will be available at https://github.com/lancopku/MUSE
研究动机与目标
- 动机:在基于 Transformer 的序列到序列任务中,提出需要比纯自注意力更好的长序列建模。
- 提出一个并行多尺度架构(MUSE),将全局(自注意力)、局部(卷积)和令牌级别(逐点)的表示融合。
- 通过经验研究在主要翻译基准上展示最先进的 BLEU,并分析促成有效多尺度融合的因素。
- 通过并行性展示计算优势,并就内核选择和共享投影提供见解。
提出的方法
- 将 MUSE 定义为具有 N 层堆叠 MUSE 模块和残差连接的编码器/解码器。
- 在每个 MUSE 模块中,并行计算 Attention(X)、DepthConv(X) 和 Pointwise(X),并融合为 X_i = X_{i-1} + Attention(X_{i-1}) + Conv(X_{i-1}) + Pointwise(X_{i-1})。
- 使用逐通道可分离卷积,在多个卷积核大小之间进行动态内核选择,并将输入投影与自注意力共享(V1 = V2 = V W^V)。
- 提供不含卷积的 MUSE-simple,以单独隔离并行多尺度设计的效果。
- 在大型 WMT 数据集上训练 MUSE-base/Large,在较小的 IWSLT 数据集上训练 MUSE-base,并使用标准的 NMT 评估设置。
实验结果
研究问题
- RQ1并行多尺度表示能否在序列到序列任务中优于纯自注意力或纯卷积模型?
- RQ2在多尺度模块中将自注意力与卷积之间的投影共享是否有助于学习?
- RQ3内核大小选择(动态 vs 固定)如何影响长序列的性能?
- RQ4与 Transformer 相比,将 MUSE 模块并行化带来的实际推理加速是多少?
- RQ5这些收益是否在大规模和小规模的翻译数据集上具有普遍性?
主要发现
| 模型 | En-De BLEU | En-Fr BLEU |
|---|---|---|
| ConvSeq2seq | 25.2 | 40.5 |
| SliceNet | 26.1 | - |
| Transformer | 28.4 | 41.0 |
| Weighted Transformer | 28.9 | 41.4 |
| Layer-wise Coordination | 29.1 | - |
| Transformer (relative position) | 29.2 | 41.5 |
| Transformer (Ott et al. 2018) | 29.3 | 43.2 |
| Evolved Transformer | 29.8 | 41.3 |
| DynamicConv | 29.7 | 43.2 |
| Local Joint Self-attention | 29.7 | 43.3 |
| MUSE-simple | 29.8 | 43.2 |
| MUSE | 29.9 | 43.5 |
- MUSE-large 在 En-De 上达到 29.9 BLEU,在 En-Fr 上达到 43.5 BLEU,优于具有可比规模和数据量的先前模型。
- MUSE-simple 已经能获得强结果,在不使用卷积的情况下也可接近最先进水平;加入 DepthConv 进一步提升。
- 自注意力与卷积之间的共享投影显著提升性能(相比分离投影 +1.4 BLEU)。
- 动态选择的卷积核优于固定的大/小卷积核,最佳配置在所评估的任务上达到最高 BLEU。
- 在参数数量相当的情况下,MUSE 相对于 Transformer 的推理速度提升约 31%。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。