QUICK REVIEW

[論文レビュー] Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Xuan-Phi Nguyen, Shrey Pandit|arXiv (Cornell University)|Jan 23, 2026

Mobile Crowdsensing and Crowdsourcing被引用数 0

ひとこと要約

論文は Least-Loaded Expert Parallelism (LLEP) を紹介します。動的ルーティングスキームで、過負荷GPUから過剰なトークンとエキスパートパラメータを再分配して、不均衡なMoEモデルの負荷を平準化し、標準のExpert Parallelism (EP) に比べて大幅な速度アップとメモリ節約を実現します。

ABSTRACT

Mixture-of-Experts (MoE) models are typically pre-trained with explicit load-balancing constraints to ensure statistically balanced expert routing. Despite this, we observe that even well-trained MoE models exhibit significantly imbalanced routing. This behavior is arguably natural-and even desirable - as imbalanced routing allows models to concentrate domain-specific knowledge within a subset of experts. Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices, but with a less-discussed assumption of balanced routing. Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures on overloaded devices during post-training or inference, where explicit load balancing is often inapplicable. We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones. This ensures that all devices complete their workloads within the minimum collective latency while respecting memory constraints. Across different model scales, LLEP achieves up to 5x speedup and 4x reduction in peak memory usage compared to standard EP. This enables faster and higher-throughput post-training and inference, with ~1.9x faster for gpt-oss-120b. We support our method with extensive theoretical analysis and comprehensive empirical evaluations, including ablation studies. These results illuminate key trade-offs and enable a principled framework for hardware-specific hyper-parameter tuning to achieve optimal performance.

研究の動機と目的

自然に不均衡なエキスパートルーティングが訓練済み MoE モデルで生じる問題と、それが EP 効率に与える影響を動機づける。
メモリ制約を尊重しつつ、過剰なトークンとエキスパート重みを動的に再ルーティングしてロードの低いデバイスへ移す LLEP を提案。
不均衡下のレイテンシとメモリの理論的・実証的分析を提供し、実モデルで実用的な利得を示す。
ポストトレーニングおよび推論シナリオでスループットを最大化するハードウェア志向の調整についてのガイダンスを提供。

提案手法

問題の定義: EP 中の MoE 層におけるポストトレーニングまたは推論時の不均衡なトークンルーティング。
LLA（least-loaded assignment）アルゴリズムを用いて過負荷GPU から過負荷のないGPU へ過剰トークンの spillover を誘発する LLEP を提案。
LLAS（spill routine）を開発し、残りのワークロードと対応する重みをGPU間で転送。
バックワードパス対応と厳密なMoE計算を含む、完全な LLEP dispatch-combine ワークフローを提示。
遅延とピークメモリ分析を提供し、スピリングがいつどのように発生すべきかを正当化し、チューニング可能なハードウェア認識パラメータ（α, m, λ）を導入。
複数の MoE アーキテクチャに渡るエンドツーエンドと統制実験を示し、スピードアップとメモリ削減を示す。

実験結果

リサーチクエスチョン

RQ1事前学習、ファインチューニング、推論時に最先端 MoE モデルで不均衡なルーティングはどのように現れますか？
RQ2ロード認識型分散ルーティング方針は MoE の挙動を変えずに各GPUのレイテンシとピークメモリを削減できますか？
RQ3不均衡下で least-loaded ルーティングあり/なしの MoE の理論的・実証的コストダイナミクスは何ですか？
RQ4ハイパーパラメータ α, m, λ がモデル規模とハードウェア構成 across における LLEP の性能にどう影響しますか？
RQ5エンドツーエンドの導入（例: gpt-oss-20b/120b）は標準 EP と比較してスループットとメモリ安定性に LLEP からの恩恵を受けますか？

主な発見

LLEP は極端な不均衡下で標準 EP に比べて最大 5× のスピードアップを達成しつつメモリ使用を安定に保つ。
LLEP による GPU あたりのピークメモリは不均衡シナリオを跨いでもほぼ一定で、標準 EP では最大で 4× 増加。
実モデルでのエンドツーエンドのスループット向上は gpt-oss-20b で最大 2.2×、gpt-oss-120b で 1.9× に達する。
LLEP を用いたトレーニングは実用的オーバーヘッド下で EP に比べ ~1.25× 高速な収束をもたらす。
アブレーションではより大きなバッチサイズがより大きなスピードアップを示し、より高い α はスピードアップを減少させることを示し、スケール時には負荷を均等化する方が好ましいことを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。