[論文レビュー] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
この論文は Switch Transformer を導入します。これは 1-expert routing のスパース活性化 Mixture-of-Experts モデルで、トリリオン Parameters の規模までの大規模パラメータ数を達成し、訓練の安定性を改善し、固定 FLOPs 下での事前学習をより高速化し、密なモデルへ蒸留して品質を大幅に保持できる、という内容です。
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.
研究の動機と目的
- Motivate scaling Transformer models by increasing parameter count while keeping per-token compute fixed.
- Simplify and stabilize Mixture-of-Experts routing to enable scalable sparse models on TPU/GPU hardware.
- Demonstrate training stability with mixed precision and new initialization schemes.
- Show practical benefits across pre-training, fine-tuning, and multilingual settings.
提案手法
- Propose a Switch routing scheme that routes each token to a single expert (k=1) in a Mixture-of-Experts layer.
- Use a differentiable router with a load-balancing auxiliary loss to distribute tokens across experts.
- Operate the sparse FFN as a Switch FFN that processes tokens independently, with capacity factors and overflow handling.
- Apply selective precision training (float32 routing computations with bfloat16 elsewhere) to stabilize training.
- Introduce initialization scaling and expert regularization to enable larger expert counts and stable fine-tuning.
- Provide FLOP-matched comparisons against dense and MoE baselines, and report results on pre-training, fine-tuning, and multilingual tasks.
実験結果
リサーチクエスチョン
- RQ1Can a simplified single-expert routing (Switch) maintain or improve model quality while reducing routing cost compared to traditional MoE?
- RQ2How does increasing the number of experts (while keeping FLOPs per token fixed) affect training speed and sample efficiency?
- RQ3What training techniques (precision, initialization, regularization) are required to stabilize large sparse models?
- RQ4Do Switch Transformers provide tangible benefits across pre-training, fine-tuning, and multilingual settings?
- RQ5Can large sparse models be distilled into smaller dense models without large losses in quality?
主な発見
- Switch Transformers achieve 7x+ pre-training speedups over tuned T5 baselines at the same compute budget.
- Switch-Base with 64 experts trains faster than T5-Base and attains similar or better quality, demonstrating strong speed-quality benefits.
- At wall-clock, Switch Transformers outperform dense baselines of equivalent FLOPs, with notable time savings (e.g., 64-expert Switch-Base reaching similar quality in about one-seventh the time of T5-Base).
- Switch-Large FLOP-matched to T5-Large yields superior scaling and fine-tuning performance over larger dense baselines.
- Universal multilingual gains across 101 languages with 91% of languages benefiting from 4x+ speedups over mT5.
- Large sparse models can be distilled into compact dense models, preserving approximately 30% of the sparse model improvements while using ~1/20th of the parameters.
- Selective precision training (local router computations in float32) stabilizes training while preserving near-bfloat16 speed.
- Initialization and regularization strategies enable stable training of trillion-parameter-scale models.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。