QUICK REVIEW

[논문 리뷰] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng|arXiv (Cornell University)|2024. 01. 11.

Topic Modeling인용 수 16

한 줄 요약

DeepSeekMoE는 미세한 분할과 공유 전문가 격리를 통해 매우 전문화된 MoE 전문가를 달성하고, 전통적인 MoE 아키텍처 대비 효율성과 성능을 향상시키며 2B에서 145B 매개변수까지 시연합니다.

ABSTRACT

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts; (2) isolating $K_s$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

연구 동기 및 목표

MoE 아키텍처의 지식 하이브리드성 및 중복성에 대한 동기 부여 및 해결.
총 매개변수나 컴퓨트를 증가시키지 않으면서 전문가 전문화를 향상시키는 DeepSeekMoE를 제안.
2B에서 145B 매개변수까지 확장 가능성을 입증하며 경쟁력 있거나 우수한 성능.
세밀한 분할과 공유 격리가 중복성을 줄이고 효율성을 향상시키는지 검증.
감독 학습 미세 조정으로 채팅 설정과 공개 릴리스를 위한 정렬 가능성을 보여줌

제안 방법

두 가지 핵심 전략 도입: (i) FFN 중간체를 m개의 하위 전문가로 분할하여 고정 비용 하에서 mK 전문가를 활성화하는 미세한 전문 분할, (ii) Ks 공유 전문을 항상 활성화하도록 설계하여 일반 지식을 통합하는 공유 전문가 격리.
동일한 총 매개변수 수를 유지하면서 미세한 분류와 공유 전문가로 MoE 레이어를 형식화.
라우팅 붕괴를 완화하고 계산을 분배하기 위해 전문가- 및 디바이스 수준의 부하 균형 손실을 통합.
약 100B 토큰 규모의 다국어 말뭉치에서 2B 매개변수 MoE 변형으로 학습하고 Hash Layer, Switch Transformer, GShard 베이스라인과 비교.
2B, 16B, 145B로 확장 실험을 수행해 밀집 모델 및 더 큰 MoE 베이스라인에 비해 성능을 평가

실험 결과

연구 질문

RQ1세밀한 전문가 분할이 조합적 라우팅 유연성을 증가시키고 총 매개변수나 계산량을 늘리지 않으면서 전문화를 개선할 수 있는가?
RQ2공유 전문가의 소수 집합을 격리하는 것이 중복성을 줄이고 MoE 모델의 매개변수 효율성을 개선하는가?
RQ3DeepSeekMoE가 2B, 16B, 145B 스케일에서 GShard 및 밀집 베이스라인에 비해 표준 NLP 벤치마크에서 어떤 성능을 보이는가?
RQ4라우팅 균형 손실이 학습 안정성과 모델 성능에 미치는 영향은 무엇인가?
RQ5DeepSeekMoE가 MoE 성능의 상한에 맞추고 실용적인 GPU 요구사항으로 공개 가능한 16B 모델을 가능하게 하는가?

주요 결과

지표	# Shot	Dense	Hash Layer	Switch	GShard	DeepSeekMoE
Pile (Loss)	N/A	2.060	1.932	1.881	1.808	1.808
HellaSwag (Acc.)	0-shot	54.8	50.5	49.1	54.8	54.8
PIQA (Acc.)	0-shot	72.3	70.6	70.5	72.3	72.3
ARC-easy (Acc.)	0-shot	49.4	43.9	45.9	49.4	49.4
ARC-challenge (Acc.)	0-shot	34.3	31.6	30.2	34.3	34.3
RACE-middle (Acc.)	5-shot	44.0	42.1	43.6	44.0	44.0
RACE-high (Acc.)	5-shot	31.7	30.4	30.9	31.7	31.7
HumanEval (Pass@1)	0-shot	4.9	3.7	2.4	4.9	4.9
MBPP (Pass@1)	3-shot	2.2	0.2	0.4	2.2	2.2
TriviaQA (EM)	5-shot	16.6	10.2	8.9	16.6	16.6
NaturalQuestions (EM)	5-shot	5.7	3.2	2.5	5.7	5.7

DeepSeekMoE 2B가 GShard 2B를 상당한 차이로 능가하고 GShard 2.9B의 성능에 거의 근접하며, 총 매개변수는 같고 활성화된 매개변수는 더 적습니다.
DeepSeekMoE 2B는 총 매개변수가 동일한 밀집 모델의 성능에 거의 근접하여 MoE의 상한 성능에 도달할 수 있음을 시사합니다.
DeepSeekMoE 16B가 약 40%의 계산량으로 DeepSeek 7B 및 LLaMA2 7B와 비슷한 성능을 달성하면서도 활성 매개변수 수가 비슷한 모델보다 우수합니다.
145B에서 DeepSeekMoE는 GShard에 비해 상당한 이점을 보이고 28.5%(잠재적으로 18.2%)의 계산량으로 DeepSeek 67B에 근접합니다.
소거 연구를 통해 미세한 분할과 공유 격리 모두 성능 향상과 전문가 전문성 증가에 기여함이 확인됩니다.
분석은 DeepSeekMoE가 라우팅된 전문가들 사이의 중복성이 더 낮고 공유 전문가는 라우팅된 전문가로 대체될 수 없음을 시사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.