[论文解读] The Evolved Transformer
该论文应用进化神经架构搜索(NAS)结合 Progressive Dynamic Hurdles,以 Transformer 作为种子,寻找一个更快且更准确的前馈序列到序列模型,在多项语言任务上超越 Transformer。它在 WMT’14 En-De 上实现了新的最优 BLEU,并在较小规模时参数更高效。
Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.
研究动机与目标
- Motivate neural architecture search (NAS) for improving feed-forward seq2seq models beyond the Transformer.
- Construct a large, Transformer-representative search space that includes modern seq2seq components.
- Develop Progressive Dynamic Hurdles (PDH) to efficiently search directly on compute-intensive tasks.
- Seed the search with the Transformer to improve search efficiency and performance.
- Demonstrate that the evolved architecture, the Evolved Transformer (ET), outperforms the Transformer across multiple tasks and sizes.
提出的方法
- Use tournament-selection evolutionary NAS with a gene-encoding that represents encoder/decoder blocks.
- Seed the initial population with the Transformer to anchor the search.
- Construct a two-cell search space (encoder and decoder) with NASNet-style blocks and multiple branch-level fields.
- Introduce Progressive Dynamic Hurdles (PDH) to allocate more training steps to promising candidates while discarding poor ones early.
- Train candidate models on WMT’14 En-De to evaluate fitness via validation perplexity, then mutate and select to evolve architectures.
实验结果
研究问题
- RQ1Can neural architecture search find a feed-forward seq2seq architecture superior to the Transformer for translation and language modeling?
- RQ2Does seeding the search with the Transformer and using PDH improve NAS efficiency and final model quality?
- RQ3What architectural characteristics emerge in the evolved model compared with the Transformer?
- RQ4How does the Evolved Transformer (ET) compare to the Transformer across multiple tasks and model sizes?
主要发现
- ET consistently outperforms the Transformer across translation and language modeling tasks.
- On WMT’14 En-De, ET achieves a state-of-the-art BLEU of 29.8 with a comparable parameter count to the Transformer.
- At mobile-friendly sizes (~7M parameters), ET matches Transformer quality with 37.6% fewer parameters and gains ~0.7 BLEU.
- ET shows improvements at base and big sizes across En-De, En-Fr, En-Cs, and LM1B, with large gains in smaller models.
- ET’s notable architectural traits include wide depth-wise separable convolutions in lower layers, branching structures, gated activations, and swish activations.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。