QUICK REVIEW

[论文解读] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph|arXiv (Cornell University)|Jan 11, 2021

Topic Modeling被引用 700

一句话总结

本文提出 Switch Transformer，一种稀疏激活的 Mixture-of-Experts 模型，采用 1-expert 路由，在参数规模可达万亿级别的同时，提高训练稳定性，在固定 FLOPs 下实现更快的预训练，并且可以蒸馏为紧凑的密集模型，保留显著的质量。

ABSTRACT

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.

研究动机与目标

通过在保持每个 token 的计算量固定的同时增加参数量，推动 Transformer 模型的扩展。
简化并稳定 Mixture-of-Experts 路由，以在 TPU/GPU 硬件上实现可扩展的稀疏模型。
展示在混合精度和新初始化方案下的训练稳定性。
展示在预训练、微调和多语言场景中的实际收益。

提出的方法

提出一种 Switch 路由方案，在 Mixture-of-Experts 层中将每个 token 路由到单个专家（k=1）。
使用具有负载均衡辅助损失的可微分路由器，在专家之间分配 token。
将稀疏 FFN 作为 Switch FFN 独立处理 token，具有容量因子和溢出处理。
应用选择性精度训练（在路由计算中使用 float32，其他地方使用 bfloat16）以稳定训练。
引入初始化缩放和专家正则化，以实现更大规模的专家数量和稳定的微调。
提供与密集模型和 MoE 基线在 FLOP 匹配下的对比，并报告在预训练、微调和多语言任务上的结果。

实验结果

研究问题

RQ1在降低路由成本的同时，简化的单专家路由（Switch）是否能够维持或提升模型质量，与传统的 MoE 相比？
RQ2在固定每个 token 的 FLOPs 的前提下，增加专家数量如何影响训练速度和样本效率？
RQ3哪些训练技术（精度、初始化、正则化）是稳定大型稀疏模型所必需的？
RQ4Switch Transformers 在预训练、微调和多语言设置中是否带来切实的收益？
RQ5大型稀疏模型能否蒸馏为较小的密集模型，同时保持质量损失不大？

主要发现

Switch Transformers 在相同计算预算下，相较于调优过的 T5 基线实现了 7 倍以上的预训练加速。
Switch-Base 64 专家比 T5-Base 训练更快，且达到相似或更高的质量，显示出显著的速度-质量收益。
就实际墙钟时间而言，Switch Transformers 的 FLOPs 等效的密集基线表现更差，时间显著节省（例如 64 专家 Switch-Base 在大约相当于 T5-Base 的七分之一的时间内达到相似质量）。
Switch-Large 在 FLOP 匹配到 T5-Large 的设置下，相较更大的密集基线具有更优的扩展和微调性能。
在 101 种语言中的普遍多语言收益，91% 语言相对于 mT5 获得 4x 及以上的加速。
大型稀疏模型可以蒸馏为紧凑的密集模型，保留大约 30% 的稀疏模型改进，同时参数数量约为原来 1/20。
选择性精度训练（局部路由计算使用 float32）在保持接近 bf16 速度的同时稳定了训练。
初始化和正则化策略使万亿参数规模模型的训练变得稳定。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。