Skip to main content
QUICK REVIEW

[论文解读] The Evolved Transformer

David R. So, Liang Chen|arXiv (Cornell University)|Jan 30, 2019
Magnetic Properties and Applications被引用 196
一句话总结

该论文应用进化神经架构搜索(NAS)结合 Progressive Dynamic Hurdles,以 Transformer 作为种子,寻找一个更快且更准确的前馈序列到序列模型,在多项语言任务上超越 Transformer。它在 WMT’14 En-De 上实现了新的最优 BLEU,并在较小规模时参数更高效。

ABSTRACT

Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.

研究动机与目标

  • Motivate neural architecture search (NAS) for improving feed-forward seq2seq models beyond the Transformer.
  • Construct a large, Transformer-representative search space that includes modern seq2seq components.
  • Develop Progressive Dynamic Hurdles (PDH) to efficiently search directly on compute-intensive tasks.
  • Seed the search with the Transformer to improve search efficiency and performance.
  • Demonstrate that the evolved architecture, the Evolved Transformer (ET), outperforms the Transformer across multiple tasks and sizes.

提出的方法

  • Use tournament-selection evolutionary NAS with a gene-encoding that represents encoder/decoder blocks.
  • Seed the initial population with the Transformer to anchor the search.
  • Construct a two-cell search space (encoder and decoder) with NASNet-style blocks and multiple branch-level fields.
  • Introduce Progressive Dynamic Hurdles (PDH) to allocate more training steps to promising candidates while discarding poor ones early.
  • Train candidate models on WMT’14 En-De to evaluate fitness via validation perplexity, then mutate and select to evolve architectures.

实验结果

研究问题

  • RQ1Can neural architecture search find a feed-forward seq2seq architecture superior to the Transformer for translation and language modeling?
  • RQ2Does seeding the search with the Transformer and using PDH improve NAS efficiency and final model quality?
  • RQ3What architectural characteristics emerge in the evolved model compared with the Transformer?
  • RQ4How does the Evolved Transformer (ET) compare to the Transformer across multiple tasks and model sizes?

主要发现

  • ET consistently outperforms the Transformer across translation and language modeling tasks.
  • On WMT’14 En-De, ET achieves a state-of-the-art BLEU of 29.8 with a comparable parameter count to the Transformer.
  • At mobile-friendly sizes (~7M parameters), ET matches Transformer quality with 37.6% fewer parameters and gains ~0.7 BLEU.
  • ET shows improvements at base and big sizes across En-De, En-Fr, En-Cs, and LM1B, with large gains in smaller models.
  • ET’s notable architectural traits include wide depth-wise separable convolutions in lower layers, branching structures, gated activations, and swish activations.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。