QUICK REVIEW

[논문 리뷰] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Samyam Rajbhandari, Conglong Li|arXiv (Cornell University)|2022. 01. 14.

Domain Adaptation and Few-Shot Learning인용 수 55

한 줄 요약

이 논문은 DeepSpeed-MoE를 제시하며, PR-MoE와 Mixture-of-Students를 포함하고 최적화된 MoE 추론 시스템을 더합니다. 이는 autoregressive MoE 모델에 대해 최대 5배의 학습 비용 절감과 더 빠르고 저렴한 추론을 크게 달성합니다.

ABSTRACT

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

연구 동기 및 목표

MoE의 적용 가능성을 자 autoregressive NLG 작업으로 확장하여 품질을 유지하면서 학습 비용을 줄이는 것.
성능을 저하시키지 않으면서 모델 크기를 줄이는 새로운 아키텍처로 MoE 매개변수 효율성을 향상시킨다.
확장 가능한 배포를 위한 종단 간 고도로 최적화된 MoE 추론 시스템을 개발한다.
더 빠른 추론을 위해 MoE 모델을 추가로 압축하기 위한 Mixture-of-Students 증류를 도입한다.

제안 방법

나중 계층에 더 많은 전문가를 할당하고 효율성을 위해 잔차 연결을 사용하는 Pyramid-Residual MoE(PR-MoE)를 도입한다.
두 가지 현상을 탐구한다: (I) 더 깊은 MoE 계층은 더 많은 전문가로부터 더 큰 이점을 얻는 경향; (II) 잔차/Top2 구성은 더 낮은 통신으로 표준 MoE와 동등하거나 그 이상 성능을 달성할 수 있다.
Pyramid-MoE와 Residual-MoE를 결합하여 매개변수 효율성을 위한 PR-MoE를 만든다.
다양한 전문가 수를 가진 계층에 걸쳐 PR-MoE를 학습하기 위해 부하 불균형 없이 다중 전문가 및 다중 데이터 병렬화를 DeepSpeed-MoE에서 구현한다.
스테이지드 지식 증류를 통해 Mixture-of-Students(MoS)를 개발하고, 더 작은 깊이의 학생이 교사 PR-MoE를 모방하여 희소성을 유지한다.
MoS와 PR-MoS를 학습시키면서 MoE의 희소성과 추론 이점을 유지하는 KD(지식 증류) 형식을 제안한다.

실험 결과

연구 질문

RQ1MoE를 autoregressive NLG에 효과적으로 적용하여 품질을 손상시키지 않으면서 학습 비용을 줄일 수 있는가?
RQ2PR-MoE가 표준 MoE에 비해 매개변수 수를 크게 줄이면서 모델 품질을 유지하거나 향상시키는가?
RQ3지식 증류가 MoE의 이점을 유지하고 더 빠른 추론을 제공하는 더 작은 MoE 모델(MoS/PR-MoS)을 생산할 수 있는가?
RQ4수백~수천 개의 GPU 규모에서 낮은 지연과 비용을 제공하는 엔드-투-엔드 MoE 추론 시스템을 어떻게 설계할 수 있는가?

주요 결과

MoE 모델은 밀집한 대안보다 더 나은 검증 손실을 달성하고, 더 큰 밀집 모델의 품질과 동등하거나 그 이상을 더 낮은 학습 비용으로 달성할 수 있다(예: 1.3B+MoE-128이 6.7B 밀집 모델과 유사한 품질 달성).
학습 처리량은 동일 품질을 달성하는 MoE 모델의 경우 대형 밀집 베이스라인 대비 5배의 비용 절감을 보여준다.
PR-MoE는 표준 MoE와 비슷한 정확도에서 매개변수 수를 최대 3배까지 줄인다.
MoS 증류는 제로샷 성능이 비슷하게 유지되면서 MoE 크기를 최대 3.7배까지 추가로 축소시킬 수 있다.
DeepSpeed-MoE 추론은 기존 MoE 추론 솔루션 대비 최대 7.3배의 지연/비용 절감을 제공하고, 트릴리언 파라미터 MoE 모델에서 25 ms 미만의 초저 지연을 실현한다.
PR-MoE/MoS 조합은 대형 MoE 기준 대비 품질 손실이 거의 없이 강력한 매개변수 효율을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.