QUICK REVIEW

[論文レビュー] DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu|arXiv (Cornell University)|Dec 27, 2024

Distributed and Parallel Computing Systems被引用数 206

ひとこと要約

DeepSeek-V3 は 671B Mixture-of-Experts 言語モデルで、1トークンあたり 37B のアクティブを特徴とし、Multi-head Latent Attention と auxiliary-loss-free load balancing を備え、FP8 で 14.8T トークンを用いてトレーニングされる。オープンソースでの高い性能と、クローズドソースにおける競争力のあるパリティを達成する。

ABSTRACT

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

研究の動機と目的

大規模 Mixture-of-Experts アーキテクチャを用いてオープンソース LLM の能力を前進させる。
FP8 混合精度と dual-pipe パイプライン並列性によってトレーニングの効率と安定性を向上させる。
Multi-head Latent Attention と Cross-Node 通信最適化によって推論効率を高める。
auxiliary-loss-free load balancing 戦略と Multi-Token Prediction Objective を導入して性能を向上させる。
文脈長を延長し、人間の嗜好に合わせるためのポストトレーニング（SFT および RL）を実施する。

提案手法

推論時の KV キャッシュを削減しつつ性能を維持するために Multi-head Latent Attention (MLA) を適用する。
DeepSeekMoE アーキテクチャと auxiliary-loss-free load balancing 戦略を用いて専門家利用を均等に保つ。
Multi-Token Prediction (MTP) Objective を導入し、トレーニング信号を高密度化し、推測デコードを支援できる。
tile-wise および block-wise 量子化を用いた FP8 混合精度トレーニングと、RMSNorm および MLA アッププロジェクションの再計算などのメモリ節約技術を実装する。
DualPipe パイプライン並列性と最適化されたノード間の all-to-all カーネルを開発して通信のオーバーヘッドを隠し、細粒度のエキスパート並列性を可能にする。
InfiniBand と NVLink を活用したクロスノード通信戦略を実施し、帯域幅とレイテンシのバランスを取る。

実験結果

リサーチクエスチョン

RQ1MLA と DeepSeekMoE はスケール時の推論およびトレーニング効率にどの程度の性能向上をもたらすか？
RQ2従来の補助損失と比較して auxiliary-loss-free load balancing 戦略はモデル性能と専門家利用にどのように影響するか？
RQ3Multi-Token Prediction objective はトレーニング信号と下流タスクの性能を向上させるか？
RQ4この規模のモデルに対する FP8 トレーニングと DualPipe フレームワークの効率性と安定性への影響は？
RQ5標準ベンチマーク（コード、数学、推論）における DeepSeek-V3 のオープンソースモデルとクローズドソースモデルと比較した性能はどうか？

主な発見

Stage / Metric	Pre-Training (H800 GPU Hours)	Context Extension (H800 GPU Hours)	Post-Training (H800 GPU Hours)	Total (H800 GPU Hours)
Training Costs (GPU Hours)	2664K	119K	5K	2788K
Training Costs (USD)	$5.328M	$0.238M	$0.01M	$5.576M

DeepSeek-V3 Base はコード・数学ベンチマークで他のオープンソースベースを上回り、いくつかのタスクで一部の主要なクローズドソースモデルに接近する。
MMLU で 88.5、MMLU-Pro で 75.9、GPQA で 59.1 を達成し、選択されたベンチマークで GPT-4o および Claude-Sonnet-3.5 と同程度の性能。
事実知識では SimpleQA および Chinese SimpleQA でオープンソースの同等比較モデルを上回り、特に中国語の事実知識に優れる。
数学ベンチマークで非 long-CoT モデルの中で最先端の結果を達成し、特定のタスク（例：MATH-500）で一部の long-CoT ベースラインをも上回る。
コーディングタスクでは LiveCodeBench でトップの性能を示し、強力なコーディング能力を示す。全体的なエンジニアリングベンチマークも Claude-Sonnet-3.5 に対して競争力のある性能を示す。
トレーニングプロセスは著しく経済的（総 GPU 時間 2.788M）で、回復不能な損失スパイクやロールバックなしに非常に安定している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。