QUICK REVIEW

[论文解读] DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu|arXiv (Cornell University)|Dec 27, 2024

Distributed and Parallel Computing Systems被引用 206

一句话总结

DeepSeek-V3 是一个 671B Mixture-of-Experts 语言模型，具有每个 token 37B 活跃参数，具备多头潜在注意力（MLA）与无辅助损失的负载均衡，在 FP8 下以 14.8T tokens 进行训练；在开源场景中表现强劲，并在闭源对等性方面具有竞争力。

ABSTRACT

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

研究动机与目标

使用大型 Mixture-of-Experts 架构推进开源 LLM 的能力。
通过 FP8 混合精度和双管道流水线并行提高训练效率和稳定性。
通过 Multi-head Latent Attention 和跨节点通信优化提升推理效率。
引入无辅助损失的负载均衡策略以及 Multi-Token Prediction 目标以提升性能。
扩展上下文长度并进行后训练（SFT 和 RL）以与人类偏好对齐。

提出的方法

在推理时使用 Multi-head Latent Attention（MLA），在不影响性能的前提下减少 KV 缓存。
使用带有无辅助损失负载均衡策略的 DeepSeekMoE 架构，保持专家利用的平衡。
引入 Multi-Token Prediction（MTP）目标，使训练信号密集化并有助于推测解码。
实现 FP8 混合精度训练，采用瓦片级和分块量化，以及像重新计算 RMSNorm 和 MLA 上投影等节省内存的技术。
开发 DualPipe 流水线并行与优化的跨节点全对全内核，隐藏通信开销并实现精细粒度的专家并行。
开展跨节点通信策略，利用 InfiniBand 与 NVLink 平衡带宽与延迟。

实验结果

研究问题

RQ1MLA 与 DeepSeekMoE 在大规模推理与训练效率方面提供了哪些性能提升？
RQ2无辅助损失的负载均衡策略相比传统辅助损失对模型性能与专家利用有何影响？
RQ3Multi-Token Prediction 目标是否提升了训练信号与下游任务性能？
RQ4FP8 训练与 DualPipe 框架对此规模模型在效率与稳定性方面的影响为何？
RQ5相对于开源与闭源模型，DeepSeek-V3 在标准基准（代码、数学、推理）上的表现如何？

主要发现

Stage / Metric	Pre-Training (H800 GPU Hours)	Context Extension (H800 GPU Hours)	Post-Training (H800 GPU Hours)	Total (H800 GPU Hours)
Training Costs (GPU Hours)	2664K	119K	5K	2788K
Training Costs (USD)	$5.328M	$0.238M	$0.01M	$5.576M

DeepSeek-V3 Base 在代码和数学基准上优于其他开源基础模型，在若干任务上接近领先的闭源模型。
在 MMLU 上达到 88.5、MMLU-Pro 75.9，以及 GPQA 59.1 的分数，在选定基准上性能可与 GPT-4o 和 Claude-Sonnet-3.5 相当。
在事实性知识方面，超越开源同类，尤其在中文事实知识方面表现突出， SimpleQA 与中文 SimpleQA 表现优于对手。
在数学基准上达到非 long-CoT 模型中的最新水平，并在某些任务（如 MATH-500）甚至超越某些 long-CoT 基线。
在编码任务上，它是 LiveCodeBench 的顶尖模型，显示出强大的编码能力；整体工程基准也显示出与 Claude-Sonnet-3.5 相对竞争的性能。
训练过程极具经济性（总计 2.788M GPU 小时），且高度稳定，没有不可恢复的损失尖峰或回滚。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。