[论文解读] Depthwise Separable Convolutions for Neural Machine Translation
介绍 SliceNet,一种使用深度可分卷积和超可分卷积的卷积序列到序列模型,用于神经机器翻译,在参数量更少且不使用扩张的情况下取得了最先进的结果。
Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency. They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters required to perform at a given level (the MobileNets family of architectures). Recently, convolutional sequence-to-sequence networks have been applied to machine translation tasks with good results. In this work, we study how depthwise separable convolutions can be applied to neural machine translation. We introduce a new architecture inspired by Xception and ByteNet, called SliceNet, which enables a significant reduction of the parameter count and amount of computation needed to obtain results like ByteNet, and, with a similar parameter count, achieves new state-of-the-art results. In addition to showing that depthwise separable convolutions perform well for machine translation, we investigate the architectural changes that they enable: we observe that thanks to depthwise separability, we can increase the length of convolution windows, removing the need for filter dilation. We also introduce a new "super-separable" convolution operation that further reduces the number of parameters and computational cost for obtaining state-of-the-art results.
研究动机与目标
- Motivate reducing parameter count and computation in convolutional NMT architectures.
- Explore applying depthwise separable and grouped convolutions to sequence-to-sequence models.
- Evaluate the impact of removing filter dilation by using larger convolution windows.
- Introduce and assess the new super-separable convolution operation.
- Demonstrate state-of-the-art translation results with SliceNet under constrained resources.
提出的方法
- Propose SliceNet, a stack of depthwise separable convolution layers with residual connections and optional grouped and super-separable convolutions.
- Replace traditional regular convolutions with depthwise separable convolutions to reduce parameters and computation.
- Use two sub-networks to encode inputs and outputs, concatenated before an autoregressive decoder with attention.
- Employ layer normalization and ReLU activations within convolutional modules.
- Explore and compare dilation versus larger convolution windows for receptive field growth.
- Provide code reference to TensorFlow Tensor2Tensor implementation.
实验结果
研究问题
- RQ1Do depthwise separable convolutions improve translation quality over regular convolutions in a ByteNet-like architecture?
- RQ2Can removing dilation and relying on larger convolution windows maintain or improve performance in NMT?
- RQ3What is the impact of intermediate grouped (sub-separable) convolutions compared to full depthwise separable convolutions?
- RQ4Does the proposed super-separable convolution offer additional performance gains over standard depthwise separable convolutions?
主要发现
| 卷积类型 | 每位置参数(近似) | 负对数困惑度 | 准确性 |
|---|---|---|---|
| Non-Emb. | k·c² | -1.92 | 62.41 |
| Full | k·c² | -1.83 | 63.87 |
| Full | k·7‑7‑7 | -1.80 | 64.37 |
| Full | k‑7‑15‑15 | -1.80 | 64.30 |
| Full | k‑7‑15‑31 | -1.80 | 64.36 |
| 16 Groups | k·c²/g+c² | -1.86 | 63.46 |
| Super 2/3 | k·c+c²/g | -1.78 | 64.71 |
| Full (2048) | k·c+c²/g | -1.68 | 66.71 |
| Super 2/3 (3072) | k·c+c²/g | -1.64 | 67.27 |
- Depthwise separable convolutions yield better accuracy with fewer parameters and lower computational cost than regular convolutions in a ByteNet-like NMT model.
- Replacing dilation with larger convolution windows in depthwise separable convolutions can achieve comparable or better results; dilation is not required.
- Using grouped convolutions (16 groups) performs worse than full depthwise separable convolutions, suggesting higher separability is beneficial.
- The super-separable convolution provides incremental performance gains over standard depthwise separable convolutions.
- Larger SliceNet models with depthwise separable or super-separable convolutions achieve state-of-the-art BLEU scores on WMT EN-DE, e.g., 26.1 on newstest14 for the larger Super 2/3 model and 25.5–26.1 on newstest14/2014 when compared to prior work.
- SliceNet models use over two times fewer non-embedding parameters and FLOPs than ByteNet, while achieving superior translation quality.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。