[论文解读] Taming Sparsely Activated Transformer with Stochastic Experts
THOR 随机专家激活结合一致性正则化在低资源、丰富资源以及多语言机器翻译任务中优于标准 Transformer 和 Switch MoE 模型,显示出更高的参数效率。
Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts.
研究动机与目标
- 说明为何稀疏激活模型(SAMs)即使参数量很大也可能参数效率低下。
- 研究基于门控的路由(前k个专家)是否优于 MoE 风格结构中的随机路由。
- 提出 THOR,一种将专家随机激活并使用一致性正则化以对齐跨专家预测的 SAM。
- 在低资源、丰富资源和多语言机器翻译上评估 THOR 以评估参数效率和泛化能力。
提出的方法
- 分析基于门控的 MoE 模型,识别负载不均衡和随机路由行为。
- 引入 THOR,在每个训练迭代和推理时,每层随机激活一对专家。
- 选择一种一致性正则化项,最小化来自两次随机专家选择的交叉熵损失以及一个基于 KL 的一致性项。
- 用类似双教师的设置训练 THOR,使专家互相学习以产出一致的预测。
- 在低资源、丰富资源和多语言 MT 场景下使用标准基准评估 THOR。
实验结果
研究问题
- RQ1稀疏激活的模型在本质上是否比同等规模的密集激活模型表现差?
- RQ2基于门控的路由是否是实现 MoE 风格模型增益所必需,还是随机专家激活也能有效?
- RQ3一致性正则化是否能在专家随机激活时实现稳健的训练与推理?
- RQ4THOR 在各种 MT 任务和设置中相对于 Transformer 和 Switch Transformer 的性能优势何在?
主要发现
- THOR 在所有三种设置中始终超越原生 Transformer 与 Switch Transformer。
- 在低资源 MT 中,THOR 相对于 Switch 的平均 BLEU 提升超过 1.0 点,并优于 SMART 和 R3F 基线。
- 在丰富资源的 MT 中,THOR 在 En-De 和 En-Fr 上创下新的最先进成绩,无需数据增强或预训练。
- 在多语言 MT 中,THOR 仅有 3 亿参数就达到 5.5B 参数的 Switch MoE 的 BLEU,显示出 18x 的参数效率。
- 与 Switch Transformer 相比,THOR 展现出更高的预测一致性和更小的方差,随着模型规模增大过拟合更少。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。