QUICK REVIEW

[논문 리뷰] Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Xuan-Phi Nguyen, Shrey Pandit|arXiv (Cornell University)|2026. 01. 23.

Mobile Crowdsensing and Crowdsourcing인용 수 0

한 줄 요약

본 논문은 Least-Loaded Expert Parallelism (LLEP)을 제시한다. 과부하된 GPU에서 과잉 토큰과 전문가 파라미터를 덜 활용되는 GPU로 재분배하여 불균형한 MoE 모델에서 부하를 균형 있게 맞추는 동적 라우팅 scheme이며, 표준 Expert Parallelism (EP) 대비 상당한 속도 향상과 메모리 절약을 달성한다.

ABSTRACT

Mixture-of-Experts (MoE) models are typically pre-trained with explicit load-balancing constraints to ensure statistically balanced expert routing. Despite this, we observe that even well-trained MoE models exhibit significantly imbalanced routing. This behavior is arguably natural-and even desirable - as imbalanced routing allows models to concentrate domain-specific knowledge within a subset of experts. Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices, but with a less-discussed assumption of balanced routing. Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures on overloaded devices during post-training or inference, where explicit load balancing is often inapplicable. We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones. This ensures that all devices complete their workloads within the minimum collective latency while respecting memory constraints. Across different model scales, LLEP achieves up to 5x speedup and 4x reduction in peak memory usage compared to standard EP. This enables faster and higher-throughput post-training and inference, with ~1.9x faster for gpt-oss-120b. We support our method with extensive theoretical analysis and comprehensive empirical evaluations, including ablation studies. These results illuminate key trade-offs and enable a principled framework for hardware-specific hyper-parameter tuning to achieve optimal performance.

연구 동기 및 목표

학습된 MoE 모델에서 자연스럽게 나타나는 전문 라우팅의 불균형 문제와 이것이 EP 효율성에 미치는 영향을 제시한다.
메모리 제약을 존중하면서 과다 토큰과 전문가 가중치를 동적으로 재배치하여 언더로드된 장치로 라우팅하는 LLEP를 제안한다.
불균형하에서 지연과 메모리에 대한 이론적 및 실증 분석을 제공하고 실제 모델에서 실질적 이점을 시연한다.
사후 학습 및 추론 시나리오에서 처리량을 극대화하기 위한 하드웨어 의식적 튜닝 가이드를 제공한다.

제안 방법

문제 정의: 사후 학습 또는 추론 중 EP하에 MoE 계층에서의 토큰 라우팅 불균형.
LLA(least-loaded assignment) 알고리즘을 사용하여 과부하 GPU에서 언더로드 GPU로 과잉 토큰의 Spillover를 촉발하는 LLEP를 제안한다.
GPU 간 남아 있는 작업부하와 대응 가중치를 전송하는 Spill 루틴(LLAS)을 개발한다.
역전파 지원과 정확한 MoE 계산을 포함하는 전체 LLEP 디스패치-컴바인 워크플로를 제시한다.
언제 어떻게 스필링이 발생해야 하는지 정당화하기 위한 지연 및 피크 메모리 분석을 제공하고, 조정 가능한 하드웨어 인자(alpha, m, lambda)를 도입한다.
여러 MoE 아키텍처에 걸친 엔드투엔드 및 제어된 실험을 통해 속도 향상과 메모리 감소를 보여준다.

실험 결과

연구 질문

RQ1최신 MoE 모델에서 사전 학습, 미세 조정 또는 추론 중에 불균형한 라우팅이 어떻게 나타나는가?
RQ2부하 인식 분산 라우팅 정책이 MoE 동작을 변경하지 않으면서 GPU당 지연 및 피크 메모리를 줄일 수 있는가?
RQ3불균형하에서 최소 부하 라우팅이 있는 MoE와 없는 MoE의 이론적·실험적 비용 역학은 무엇인가?
RQ4하이퍼파라미터 α, m, λ가 모델 규모와 하드웨어 구성에 따라 LLEP 성능에 어떻게 영향을 미치는가?
RQ5엔드투엔드 배포(예: gpt-oss-20b/120b)가 표준 EP에 비해 처리량과 메모리 안정성 측면에서 LLEP의 이점을 얻는가?

주요 결과

극심한 불균형에서도 LLEP은 표준 EP 대비 최대 5× 속도 향상을 달성하는 한편 메모리 사용량은 안정적으로 유지한다.
LLEP의 GPU당 피크 메모리는 불균형 시나리오 전반에 걸쳐 거의 일정하게 유지되며, 표준 EP의 경우 최대 4× 증가까지 나타난다.
실제 모델에서 엔드투엔드 처리량 개선은 gpt-oss-20b에서 최대 2.2×, gpt-oss-120b에서 1.9×에 도달한다.
실용적인 오버헤드 하에서 LLEP로 학습하면 EP에 비해 약 1.25× 더 빠른 수렴을 얻는다.
분리실험에서 더 큰 배치 크기가 더 큰 속도향상을 보이고, 더 높은 α는 속도향상을 감소시키며, 대규모에서 균형 잡힌 작업 부하를 선호함을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.