Skip to main content
QUICK REVIEW

[论文解读] The Evolved Transformer

David R. So, Liang Chen|arXiv (Cornell University)|Jan 30, 2019
Magnetic Properties and Applications被引用 196
一句话总结

该论文应用进化神经架构搜索(NAS)结合 Progressive Dynamic Hurdles,以 Transformer 作为种子,寻找一个更快且更准确的前馈序列到序列模型,在多项语言任务上超越 Transformer。它在 WMT’14 En-De 上实现了新的最优 BLEU,并在较小规模时参数更高效。

ABSTRACT

Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.

研究动机与目标

  • Motivate neural architecture search (NAS) for improving feed-forward seq2seq models beyond the Transformer.
  • Construct a large, Transformer-representative search space that includes modern seq2seq components.
  • Develop Progressive Dynamic Hurdles (PDH) to efficiently search directly on compute-intensive tasks.
  • Seed the search with the Transformer to improve search efficiency and performance.
  • Demonstrate that the evolved architecture, the Evolved Transformer (ET), outperforms the Transformer across multiple tasks and sizes.

提出的方法

  • Use tournament-selection evolutionary NAS with a gene-encoding that represents encoder/decoder blocks.
  • Seed the initial population with the Transformer to anchor the search.
  • Construct a two-cell search space (encoder and decoder) with NASNet-style blocks and multiple branch-level fields.
  • Introduce Progressive Dynamic Hurdles (PDH) to allocate more training steps to promising candidates while discarding poor ones early.
  • Train candidate models on WMT’14 En-De to evaluate fitness via validation perplexity, then mutate and select to evolve architectures.

实验结果

研究问题

  • RQ1Can neural architecture search find a feed-forward seq2seq architecture superior to the Transformer for translation and language modeling?
  • RQ2Does seeding the search with the Transformer and using PDH improve NAS efficiency and final model quality?
  • RQ3What architectural characteristics emerge in the evolved model compared with the Transformer?
  • RQ4How does the Evolved Transformer (ET) compare to the Transformer across multiple tasks and model sizes?

主要发现

任务尺寸Transformer 参数ET 参数Transformer 困惑度ET 困惑度Transformer BLEUET BLEU
WMT’14 En-DeBase61.1M64.1M4.24 ± 0.034.03 ± 0.0228.2 ± 0.228.4 ± 0.2
WMT’14 En-DeBig210.4M221.7M3.87 ± 0.023.77 ± 0.0229.1 ± 0.129.3 ± 0.1
WMT’14 En-DeDeep224.0M218.1M3.86 ± 0.023.69 ± 0.0129.2 ± 0.129.5 ± 0.1
WMT’14 En-FrBase60.8M63.8M3.61 ± 0.013.42 ± 0.0140.0 ± 0.140.6 ± 0.1
WMT’14 En-FrBig209.8M221.2M3.26 ± 0.013.13 ± 0.0141.2 ± 0.141.3 ± 0.1
WMT’14 En-CsBase59.8M62.7M4.98 ± 0.044.42 ± 0.0127.0 ± 0.127.6 ± 0.2
WMT’14 En-CsBig207.6M218.9M4.43 ± 0.014.38 ± 0.0328.1 ± 0.128.2 ± 0.1
LM1BBig141.1M151.8M30.44 ± 0.0428.60 ± 0.03--
  • ET consistently outperforms the Transformer across translation and language modeling tasks.
  • On WMT’14 En-De, ET achieves a state-of-the-art BLEU of 29.8 with a comparable parameter count to the Transformer.
  • At mobile-friendly sizes (~7M parameters), ET matches Transformer quality with 37.6% fewer parameters and gains ~0.7 BLEU.
  • ET shows improvements at base and big sizes across En-De, En-Fr, En-Cs, and LM1B, with large gains in smaller models.
  • ET’s notable architectural traits include wide depth-wise separable convolutions in lower layers, branching structures, gated activations, and swish activations.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。