QUICK REVIEW

[论文解读] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Liu, Aixin|arXiv (Cornell University)|May 7, 2024

Expert finding and Q&A systems被引用 97

一句话总结

DeepSeek-V2 是一个236B参数的开源 MoE 语言模型，单词活跃参数为21B，128K 上下文，以及新颖的 MLA 与 DeepSeekMoE 架构，实现经济训练和高效推理，达到一流的开源性能。

ABSTRACT

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

研究动机与目标

通过经济训练和快速推理解决大型语言模型的资源与效率挑战。
开发架构以降低 KV 缓存并实现可扩展的 MoE 训练。
在英语和中文基准上实现强劲的性能，同时降低训练成本并提升推理吞吐量。

提出的方法

引入 Multi-head Latent Attention (MLA) 及低秩键值联合压缩，以在推理阶段降低 KV 缓存。
采用 DeepSeekMoE 作为 FFN，以通过稀疏路由和细粒度专家实现以经济成本训练强模型。
使用解耦的旋转位置嵌入以维持 RoPE 与 MLA 的兼容性。
实现设备限制的路由、辅助负载均衡损失以及 Token-dropping 策略，以控制 MoE 的通信与计算。
在 8.1T 多源语料库上进行预训练，其后进行有监督微调（SFT）和带有 Group Relative Policy Optimization（GRPO）的强化学习（RL）以对齐模型。
使用 YaRN 将上下文长度扩展到 128K，以实现长上下文扩展。

实验结果

研究问题

RQ1MLA 相较于标准的 MHA、GQA 和 MQA 在性能和 KV 缓存效率方面有何表现？
RQ2与密集等价物或其他 MoE 架构相比，DeepSeekMoE 能否在较低的训练成本下实现强模型性能？
RQ3在英语与中文基准上，DeepSeek-V2 相对于具有相似活跃参数数量的开源基线的表现如何？
RQ4有监督微调（SFT）和 RL 对齐对 DeepSeek-V2 Chat 在英语与中文任务上的性能有何影响？

主要发现

DeepSeek-V2 在仅有 21B 活跃参数的情况下实现了开源模型中的顶级性能。
与 DeepSeek 67B 相比，节省训练成本 42.5%，将 KV 缓存减少 93.3%，并将最大生成吞吐量提高 5.76 倍。
该模型总参数为 236B，每个 token 活跃 21B，支持 128K 上下文长度。
DeepSeek-V2 Chat（RL）在 AlpacaEval 2.0 上获得强分数（38.9 length-controlled win rate）、MT-Bench（8.97）和 AlignBench（7.91）。
在中文基准中，DeepSeek-V2 Chat（RL）在 AlignBench 上优于开源模型和许多闭源模型。
DeepSeek-V2-Lite（总参数 15.7B，活跃参数 2.4B）已向社区发布。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。