QUICK REVIEW

[論文レビュー] MoEless: Efficient MoE LLM Serving via Serverless Computing

Hanfei Yu, Bei Ouyang|arXiv (Cornell University)|Mar 6, 2026

IoT and Edge/Fog Computing被引用数 0

ひとこと要約

MoEless は、モードエキスパートを分離し、レイヤー対応の予測・スケーリング・配置を用いてレイテンシとコストを削減する、初のサーバーレス MoE サービングフレームワークです。

ABSTRACT

Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware predictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, improve GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference latency by 43% and inference cost by 84% compared to state-of-the-art solutions.

研究の動機と目的

MoE サービスにおけるエキスパート負荷の不均衡を緩和し、推論レイテンシを低減する。
MeE モデルのためのサーバーレス計算によるエキスパートの弾力的なスケーラビリティを実現する。
サーバーレス実行による MoE 推論コストを最小化する。
正確な負荷推定、スケーリング、配置の予測子と戦略を開発する。
実際の MoE モデルとワークロードで性能向上を示す。

提案手法

MoE のエキスパートをモデルから分離し、非エキスパートモジュールをデータ並列性のまま、独立したサーバーレス関数としてパッケージ化する。
次のエキスパート負荷を推定し、バッチレベルの遅延要因を特定する軽量でレイヤー対応の Expert Load Predictors を設計する。
予測された負荷と遅延目標に基づいてレプリカを調整する Expert Scaling を実装する。
最適な GPU アサインメントを実現し、機能のローカリティと GPU 利用率を最大化する Expert Placement を開発する。
各レイヤーの負荷をレプリカ間に均等に分散して作業量をバランスさせ、遅延要因を排除する推論を実行する。
Megatron-LM でのプロトタイプを作成し、実世界のトレースと 3 モデルの MoE で 8-GPU ハードウェア上で評価する。

Figure 1 . Expert load imbalance across layers for different MoE models and datasets: (a) Mixtral-8 $\times$ 7B on ShareGPT and (b) Phi-3.5-MoE on LMSYS-Chat-1M.

実験結果

リサーチクエスチョン

RQ1サーバーレス計算を MoE サービングと統合して、性能を損なうことなく弾力性を提供できるか。
RQ2レイヤー対応の軽量予測子は、遅延を招く遅れを事前に回避するために、今後のエキスパート負荷を正確に予測できるか。
RQ3ダイナミックなエキスパート需要の下で、レイテンシとコストを最適化するスケーリングと配置戦略は何か。

主な発見

MoEless は最先端のベースラインと比較して推論レイテンシを 43% 減少させる。
MoEless は最先端のベースラインと比較して推論コストを 84% 減少させる。
評価には Mixtral-8×7B、Phi-3.5-MoE、Llama-4-Scout を LMSYS-Chat-1M および ShareGPT データセットで使用。
実験は NVLinks を備えた 8-GPU テストベッド上で実施。
ゲートネットワークのレイヤー対応ファインチューニングにより、予測距離全体で負荷予測精度が向上する。
予測子は非同期で計算と重畳して追加のレイテンシを回避する。

Figure 2 . Illustration of serving Mixture-of-Experts (MoE) based Large Language Models under expert parallelism, where tokens are routed by per-layer gate networks to a sparse set of experts distributed across GPUs. Expert load imbalance triggers inefficient resource provisioning ( e.g. , over-scal

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。