QUICK REVIEW

[논문 리뷰] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Liu, Aixin|arXiv (Cornell University)|2024. 05. 07.

Expert finding and Q&A systems인용 수 97

한 줄 요약

DeepSeek-V2는 236B 매개변수의 오픈소스 MoE 언어 모델로, 토당 21B 활성화 및 128K 컨텍스트, 그리고 경제적인 학습과 효율적 추론을 가능하게 하는 새로운 MLA 및 DeepSeekMoE 아키텍처를 통해 오픈소스 최상위 성능을 달성합니다.

ABSTRACT

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

연구 동기 및 목표

대형 언어 모델의 자원 및 효율성 문제를 경제적인 학습과 빠른 추론을 통해 해결한다.
KV 캐시를 줄이고 확장 가능한 MoE 학습을 가능하게 하는 아키텍처를 개발한다.
학습 비용을 줄이고 추론 처리량을 향상시키면서 영어 및 중국어 벤치마크에서 강력한 성능을 달성한다.

제안 방법

추론 중 KV 캐시를 줄이기 위해 저랭크 키-값 공동 압축을 가진 Multi-head Latent Attention (MLA)을 도입한다.
희소 라우팅과 세밀한 전문가를 통해 경제적인 비용으로 강력한 모델 학습을 가능하게 하기 위해 FFN에 DeepSeekMoE를 채택한다.
MLA와 RoPE 호환성을 유지하기 위해 분리된 로타리 포지션 임베딩을 사용한다.
MoE에서 통신 및 계산을 제어하기 위해 디바이스 제한 라우팅, 보조 로드 밸런싱 손실, 토큰 드롭 전략을 구현한다.
8.1T 다중 소스 코퍼스에서 예비 학습하고, 그 다음 감독형 미세조정(SFT)과 그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)로 정렬된 모델을 맞추기 위한 강화 학습(RL)을 수행한다.
YaRN을 사용하여 긴 컨텍스트 확장을 통해 컨텍스트 길이를 128K로 연장한다.

실험 결과

연구 질문

RQ1MLA의 성능 및 KV 캐시 효율성 측면에서 표준 MHA, GQA, MQA와의 차이는 무엇인가?
RQ2Dense 등가물이나 다른 MoE 아키텍처와 비교할 때 DeepSeekMoE가 더 낮은 학습 비용으로 강력한 모델 성능을 가능하게 할 수 있는가?
RQ3활성화 파라미터 수가 비슷한 오픈소스 벤치마크 대비 영어와 중국어 벤치마크에서 DeepSeek-V2의 성능은 어떤가?
RQ4영어 및 중국어 작업에서 SFT 및 RL 정렬이 DeepSeek-V2 Chat 성능에 어떤 영향을 미치는가?

주요 결과

DeepSeek-V2는 겨우 21B 활성화 파라미터로 오픈소스 모델 중 최상위 성능을 달성한다.
DeepSeek 67B와 비교하여 학습 비용을 42.5% 절감하고 KV 캐시를 93.3% 감소시키며 최대 생성 처리량을 5.76배 증가시킨다.
모델은 총 236B 매개변수이고 토당 21B 활성화되며 128K 컨텍스트 길이를 지원한다.
DeepSeek-V2 Chat (RL)은 AlpacaEval 2.0에서 강력한 점수(길이 제어 승률 38.9), MT-Bench에서 8.97, AlignBench에서 7.91을 달성한다.
중국어 벤치마크에서 DeepSeek-V2 Chat (RL)은 AlignBench에서 오픈 소스 모델 및 다수의 클로즈드 소스 모델을 능가한다.
DeepSeek-V2-Lite (총 15.7B, 활성화 2.4B)이 커뮤니티에 공개된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.