QUICK REVIEW

[论文解读] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph|arXiv (Cornell University)|Jan 11, 2021

Topic Modeling参考文献 51被引用 361

一句话总结

论文提出 Switch Transformer，一种部分激活的专家混合模型，在保持每个 token 的 FLOPs 不变的同时大幅增加参数量，在预训练速度方面最高实现 7 倍加速，并能在拥有万亿参数的模型上获得更好的扩展性、微调、多语言结果，并且能蒸馏为紧凑的密集小模型。

ABSTRACT

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.

研究动机与目标

通过稀疏激活的模型（MoE）推动与计算无关的可扩展参数量，以提升训练效率和性能。
简化 Mixture-of-Experts 的路由和通信，使在 TPU 上的稳定、可扩展训练成为可能。
证明在每个 token 的 FLOPs 固定的情况下增加专家（参数）数量可以带来更快的预训练和更好的样本效率。
通过微调、多语言评估和蒸馏为紧凑的密集模型来展示下游收益。
提供实用的训练技巧，以稳定大型稀疏模型（精度处理、初始化、正则化）。

提出的方法

采用 Switch 路由机制，将每个 token 路由到单一专家（k=1），以降低路由成本和通信。
使用可微分门控 p_i(x) 并辅以负载均衡损失，促使专家之间的 token 分布更均匀。
实现带静态专家容量和容量因子的稀疏路由，以管理 token 派发和溢出。
在 Colossal Clean Crawled Corpus (C4) 上进行掩码语言建模的预训练，并进行 15% 的 token dropout 以基线 perplexity。
应用选择性精度训练以稳定低精度训练（路由计算使用 float32，而其他部分保持 bfloat16）。
在多样的 NLP 任务上进行微调、将大型稀疏模型蒸馏为更小的密集模型，并在 101 种语言上评估多语言性能。

实验结果

研究问题

RQ1一个使用单专家路由（Switch）的稀疏激活 Transformer 能否在相同计算预算下达到或超过密集和 MoE 模型的质量？
RQ2路由简化、初始化策略和精度技术如何影响大型 Switch Transformer 模型的稳定性和扩展性？
RQ3与密集基线相比，Switch Transformer 在预训练速度、微调性能和多语言设置上是否能持续带来收益？
RQ4大型稀疏模型是否可以蒸馏为紧凑的密集模型，同时保留大部分性能提升？
RQ5在固定 token FLOPs 的前提下，增加专家数量时，Switch Transformer 的扩展性表现为何？

主要发现

模型	容量	100k 步后质量	达到质量的时间（小时）	速度（示例/秒）
T5-Base	—	-1.731	Not achieved †	1600
T5-Large	—	-1.550	131.1	470
MoE-Base	2.0	-1.547	68.7	840
Switch-Base	2.0	-1.554	72.8	860
MoE-Base	1.25	-1.559	80.7	790
Switch-Base	1.25	-1.553	65.0	910
MoE-Base	1.0	-1.572	80.1	860
Switch-Base	1.0	-1.561	62.8	1000
Switch-Base+	1.0	-1.534	67.6	780

Switch Transformer 在相同计算预算下的预训练速度相较于密集和 MoE 基线更快，在某些设置下最高可达到 7 倍加速。
在保持每个 token 的 FLOPs 不变的前提下，增加专家数量会在逐步尺度上持续改善困惑度和样本效率，使模型变得更大、能力更强。
Switch-Base 具备 64 个专家，在相同计算和硬件条件下，大致用原来九分之一的时间达到与 T5-Base 相似的质量，体现了时钟结构的高效性。
即使与更大的密集基线相比，Switch Transformer 也能实现更强的性能（例如 Switch-Base 在 FLOP 匹配条件下优于 T5-Large；Switch-Large 在若干指标上超越 T5-Large）。
选择性精度（仅在路由器内将浮点数强制为 float32）实现训练稳定性，速度接近完整的 bfloat16，支持大规模稳定训练。
微调与蒸馏实验在 GLUE、SuperGLUE、QA、摘要和多语言任务上显示出 Switch 变体的显著下游收益，蒸馏在 1/20 参数的学生模型中保留约 30% 的收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。