[论文解读] The Evolved Transformer
该论文应用进化神经架构搜索(NAS)结合 Progressive Dynamic Hurdles,以 Transformer 作为种子,寻找一个更快且更准确的前馈序列到序列模型,在多项语言任务上超越 Transformer。它在 WMT’14 En-De 上实现了新的最优 BLEU,并在较小规模时参数更高效。
Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.
研究动机与目标
- Motivate neural architecture search (NAS) for improving feed-forward seq2seq models beyond the Transformer.
- Construct a large, Transformer-representative search space that includes modern seq2seq components.
- Develop Progressive Dynamic Hurdles (PDH) to efficiently search directly on compute-intensive tasks.
- Seed the search with the Transformer to improve search efficiency and performance.
- Demonstrate that the evolved architecture, the Evolved Transformer (ET), outperforms the Transformer across multiple tasks and sizes.
提出的方法
- Use tournament-selection evolutionary NAS with a gene-encoding that represents encoder/decoder blocks.
- Seed the initial population with the Transformer to anchor the search.
- Construct a two-cell search space (encoder and decoder) with NASNet-style blocks and multiple branch-level fields.
- Introduce Progressive Dynamic Hurdles (PDH) to allocate more training steps to promising candidates while discarding poor ones early.
- Train candidate models on WMT’14 En-De to evaluate fitness via validation perplexity, then mutate and select to evolve architectures.
实验结果
研究问题
- RQ1Can neural architecture search find a feed-forward seq2seq architecture superior to the Transformer for translation and language modeling?
- RQ2Does seeding the search with the Transformer and using PDH improve NAS efficiency and final model quality?
- RQ3What architectural characteristics emerge in the evolved model compared with the Transformer?
- RQ4How does the Evolved Transformer (ET) compare to the Transformer across multiple tasks and model sizes?
主要发现
| 任务 | 尺寸 | Transformer 参数 | ET 参数 | Transformer 困惑度 | ET 困惑度 | Transformer BLEU | ET BLEU |
|---|---|---|---|---|---|---|---|
| WMT’14 En-De | Base | 61.1M | 64.1M | 4.24 ± 0.03 | 4.03 ± 0.02 | 28.2 ± 0.2 | 28.4 ± 0.2 |
| WMT’14 En-De | Big | 210.4M | 221.7M | 3.87 ± 0.02 | 3.77 ± 0.02 | 29.1 ± 0.1 | 29.3 ± 0.1 |
| WMT’14 En-De | Deep | 224.0M | 218.1M | 3.86 ± 0.02 | 3.69 ± 0.01 | 29.2 ± 0.1 | 29.5 ± 0.1 |
| WMT’14 En-Fr | Base | 60.8M | 63.8M | 3.61 ± 0.01 | 3.42 ± 0.01 | 40.0 ± 0.1 | 40.6 ± 0.1 |
| WMT’14 En-Fr | Big | 209.8M | 221.2M | 3.26 ± 0.01 | 3.13 ± 0.01 | 41.2 ± 0.1 | 41.3 ± 0.1 |
| WMT’14 En-Cs | Base | 59.8M | 62.7M | 4.98 ± 0.04 | 4.42 ± 0.01 | 27.0 ± 0.1 | 27.6 ± 0.2 |
| WMT’14 En-Cs | Big | 207.6M | 218.9M | 4.43 ± 0.01 | 4.38 ± 0.03 | 28.1 ± 0.1 | 28.2 ± 0.1 |
| LM1B | Big | 141.1M | 151.8M | 30.44 ± 0.04 | 28.60 ± 0.03 | - | - |
- ET consistently outperforms the Transformer across translation and language modeling tasks.
- On WMT’14 En-De, ET achieves a state-of-the-art BLEU of 29.8 with a comparable parameter count to the Transformer.
- At mobile-friendly sizes (~7M parameters), ET matches Transformer quality with 37.6% fewer parameters and gains ~0.7 BLEU.
- ET shows improvements at base and big sizes across En-De, En-Fr, En-Cs, and LM1B, with large gains in smaller models.
- ET’s notable architectural traits include wide depth-wise separable convolutions in lower layers, branching structures, gated activations, and swish activations.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。