QUICK REVIEW

[论文解读] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Samyam Rajbhandari, Conglong Li|arXiv (Cornell University)|Jan 14, 2022

Domain Adaptation and Few-Shot Learning被引用 55

一句话总结

本文提出 DeepSpeed-MoE，包括 PR-MoE 和 Mixture-of-Students，以及一个优化的 MoE 推理系统，在自回归 MoE 模型上实现多达 5x 的训练成本降低，以及显著更快、成本更低的推理。

ABSTRACT

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

研究动机与目标

将 MoE 的适用性扩展到自回归 NLG 任务，在降低训练成本的同时保持质量。
通过新颖架构提高 MoE 的参数效率，在不牺牲性能的前提下减小模型规模。
开发端到端、高度优化的 MoE 推理系统，以实现可扩展部署。
引入 Mixture-of-Students 蒸馏以进一步压缩 MoE 模型，从而实现更快的推理。

提出的方法

引入 Pyramid-Residual MoE (PR-MoE)，在后部层分配更多专家并使用残差连接以提高效率。
探索两个现象：(I) 更深的 MoE 层从更多专家中获益更多；(II) 残差/Top2 配置在较低通信量下可以达到甚至超过标准 MoE 的性能。
将 Pyramid-MoE 与 Residual-MoE 结合，创建用于参数效率的 PR-MoE。
在 DeepSpeed-MoE 实现灵活的多专家和多数据并行，以在跨层具有不同专家数量的情况下训练 PR-MoE，且无负载不均。
通过分阶段知识蒸馏开发 Mixture-of-Students (MoS)，其中较小深度的学生模型镜像教师 PR-MoE，保持稀疏性。
提出 KD 公式以训练 MoS 和 PR-MoS，同时保留 MoE 的稀疏性和推理优势。

实验结果

研究问题

RQ1MoE 能否有效应用于自回归 NLG，以在不牺牲质量的前提下实现训练成本的降低？
RQ2与标准 MoE 相比，PR-MoE 在保持或提升模型质量的同时是否显著降低参数数量？
RQ3知识蒸馏是否能产生更小的 MoE 模型（MoS/PR-MoS），并保留 MoE 的优势且提供更快的推理？
RQ4如何设计一个端到端的 MoE 推理系统，在规模（数百到数千 GPU）下实现低延迟和低成本？

主要发现

MoE 模型在验证损失上优于密集对照，在训练成本较低的情况下可以达到或超过更大密集模型的质量（例如，1.3B+MoE-128 的质量与 6.7B 密集模型相近）。
训练吞吐量显示 MoE 模型在达到与更大密集基线相同质量时实现了 5x 的成本降低。
PR-MoE 的参数数量最多可减少 3 倍，且精度与标准 MoE 相当。
MoS 蒸馏在保持类似零-shot 性能的情况下，最多可将 MoE 规模进一步降低 3.7x。
DeepSpeed-MoE 推理相比现有 MoE 推理解决方案， latency/成本降低最多 7.3x，对于万亿参数 MoE 模型 latency 低于 25 ms 的超高速。
PR-MoE/MoS 组合在对比更大 MoE 基线时，实现了强参数效率且质量损失极小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。