QUICK REVIEW

[论文解读] The Evolved Transformer

David R. So, Liang Chen|arXiv (Cornell University)|Jan 30, 2019

Magnetic Properties and Applications被引用 196

一句话总结

该论文应用进化神经架构搜索（NAS）结合 Progressive Dynamic Hurdles，以 Transformer 作为种子，寻找一个更快且更准确的前馈序列到序列模型，在多项语言任务上超越 Transformer。它在 WMT’14 En-De 上实现了新的最优 BLEU，并在较小规模时参数更高效。

ABSTRACT

Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.

研究动机与目标

Motivate neural architecture search (NAS) for improving feed-forward seq2seq models beyond the Transformer.
Construct a large, Transformer-representative search space that includes modern seq2seq components.
Develop Progressive Dynamic Hurdles (PDH) to efficiently search directly on compute-intensive tasks.
Seed the search with the Transformer to improve search efficiency and performance.
Demonstrate that the evolved architecture, the Evolved Transformer (ET), outperforms the Transformer across multiple tasks and sizes.

提出的方法

Use tournament-selection evolutionary NAS with a gene-encoding that represents encoder/decoder blocks.
Seed the initial population with the Transformer to anchor the search.
Construct a two-cell search space (encoder and decoder) with NASNet-style blocks and multiple branch-level fields.
Introduce Progressive Dynamic Hurdles (PDH) to allocate more training steps to promising candidates while discarding poor ones early.
Train candidate models on WMT’14 En-De to evaluate fitness via validation perplexity, then mutate and select to evolve architectures.

实验结果

研究问题

RQ1Can neural architecture search find a feed-forward seq2seq architecture superior to the Transformer for translation and language modeling?
RQ2Does seeding the search with the Transformer and using PDH improve NAS efficiency and final model quality?
RQ3What architectural characteristics emerge in the evolved model compared with the Transformer?
RQ4How does the Evolved Transformer (ET) compare to the Transformer across multiple tasks and model sizes?

主要发现

任务	尺寸	Transformer 参数	ET 参数	Transformer 困惑度	ET 困惑度	Transformer BLEU	ET BLEU
WMT’14 En-De	Base	61.1M	64.1M	4.24 ± 0.03	4.03 ± 0.02	28.2 ± 0.2	28.4 ± 0.2
WMT’14 En-De	Big	210.4M	221.7M	3.87 ± 0.02	3.77 ± 0.02	29.1 ± 0.1	29.3 ± 0.1
WMT’14 En-De	Deep	224.0M	218.1M	3.86 ± 0.02	3.69 ± 0.01	29.2 ± 0.1	29.5 ± 0.1
WMT’14 En-Fr	Base	60.8M	63.8M	3.61 ± 0.01	3.42 ± 0.01	40.0 ± 0.1	40.6 ± 0.1
WMT’14 En-Fr	Big	209.8M	221.2M	3.26 ± 0.01	3.13 ± 0.01	41.2 ± 0.1	41.3 ± 0.1
WMT’14 En-Cs	Base	59.8M	62.7M	4.98 ± 0.04	4.42 ± 0.01	27.0 ± 0.1	27.6 ± 0.2
WMT’14 En-Cs	Big	207.6M	218.9M	4.43 ± 0.01	4.38 ± 0.03	28.1 ± 0.1	28.2 ± 0.1
LM1B	Big	141.1M	151.8M	30.44 ± 0.04	28.60 ± 0.03	-	-

ET consistently outperforms the Transformer across translation and language modeling tasks.
On WMT’14 En-De, ET achieves a state-of-the-art BLEU of 29.8 with a comparable parameter count to the Transformer.
At mobile-friendly sizes (~7M parameters), ET matches Transformer quality with 37.6% fewer parameters and gains ~0.7 BLEU.
ET shows improvements at base and big sizes across En-De, En-Fr, En-Cs, and LM1B, with large gains in smaller models.
ET’s notable architectural traits include wide depth-wise separable convolutions in lower layers, branching structures, gated activations, and swish activations.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。