Skip to main content
QUICK REVIEW

[论文解读] MoEless: Efficient MoE LLM Serving via Serverless Computing

Hanfei Yu, Bei Ouyang|arXiv (Cornell University)|Mar 6, 2026
IoT and Edge/Fog Computing被引用 0
一句话总结

MoEless 是第一款无服务器 MoE 服务框架,通过预测专家负载、扩展副本以及在 GPU 间优化放置来降低潜在延迟与成本。

ABSTRACT

Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware predictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, improve GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference latency by 43% and inference cost by 84% compared to state-of-the-art solutions.

研究动机与目标

  • 在 MoE 服务中缓解专家负载不均以降低推理延迟。
  • 通过无服务器计算实现 MoE 模型的弹性专家扩展性。
  • 通过无服务器专家执行来最小化 MoE 推理成本。
  • 开发用于准确负载估计、扩展和放置的预测模型与策略。
  • 在真实 MoE 模型和工作负载上展示性能提升。

提出的方法

  • 将 MoE 专家与模型解耦并打包成独立的无服务器函数,同时将非专家模块保留在数据并行性中。
  • 设计轻量化、层感知的专家负载预测器以估计即将到来的专家负载并识别批量级的拖尾。
  • 实现专家扩展以基于预测负载和延迟目标来调整副本数量。
  • 开发专家放置以优化 GPU 分配并最大化函数局部性与 GPU 利用率。
  • 通过将每一层的负载均匀分配到各副本来进行推理,以平衡工作负载并消除拖尾。
  • 在 Megatron-LM 上进行原型设计,在具备真实轨迹和三个 MoE 模型的八 GPU 硬件上进行评估。
Figure 1 . Expert load imbalance across layers for different MoE models and datasets: (a) Mixtral-8 $\times$ 7B on ShareGPT and (b) Phi-3.5-MoE on LMSYS-Chat-1M.
Figure 1 . Expert load imbalance across layers for different MoE models and datasets: (a) Mixtral-8 $\times$ 7B on ShareGPT and (b) Phi-3.5-MoE on LMSYS-Chat-1M.

实验结果

研究问题

  • RQ1如何在不影响性能的前提下,将无服务器计算与 MoE 服务结合以实现弹性?
  • RQ2层感知、轻量级预测器是否能够准确预测即将到来的专家负载,以提前消除拖尾?
  • RQ3在动态专家需求下,哪些扩展与放置策略可以优化延迟与成本?

主要发现

  • 与最先进基线相比,MoEless 将推理延迟降低了 43%。
  • 与最先进基线相比,MoEless 将推理成本降低了 84%。
  • 评估使用 Mixtral-8×7B、Phi-3.5-MoE 以及 Llama-4-Scout,在 LMSYS-Chat-1M 与 ShareGPT 数据集上进行。
  • 在具备 NVLinks 的八 GPU 测试平台上进行实验。
  • 对门网络的分层微调提升了跨预测距离的负载预测准确性。
  • 预测器是异步的并与计算 overlapped,以避免额外的延迟。
Figure 2 . Illustration of serving Mixture-of-Experts (MoE) based Large Language Models under expert parallelism, where tokens are routed by per-layer gate networks to a sparse set of experts distributed across GPUs. Expert load imbalance triggers inefficient resource provisioning ( e.g. , over-scal
Figure 2 . Illustration of serving Mixture-of-Experts (MoE) based Large Language Models under expert parallelism, where tokens are routed by per-layer gate networks to a sparse set of experts distributed across GPUs. Expert load imbalance triggers inefficient resource provisioning ( e.g. , over-scal

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。