[Paper Review] Revisiting Parameter Server in LLM Post-Training
The paper introduces On-Demand Communication (ODC), a decentralized parameter-server–like scheme that replaces per-layer collectives in Fully Sharded Data Parallel (FSDP) with point-to-point transfers, improving device utilization and throughput for imbalanced LLM post-training workloads (SFT and RL) with up to 36% speedup.
Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose extbf{On-Demand Communication (ODC)}, which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at https://github.com/sail-sg/odc.
Motivation & Objective
- Motivate the need for robustness to workload imbalance in LLM post-training where sequence length variance causes synchronization barriers.
- Propose On-Demand Communication (ODC) to adapt PS concepts into FSDP without sacrificing memory efficiency.
- Demonstrate that ODC improves device utilization and training throughput across SFT and RL tasks.
- Provide practical guidance on load balancing and implementation to mitigate inter-node communication overhead.
Proposed method
- Replace per-layer all-gather and reduce-scatter in FSDP with direct point-to-point parameter fetches and gradient pushes.
- Decouple device progress by relaxing synchronization from layer to minibatch granularity while preserving synchronous optimization semantics.
- Render FSDP as a decentralized parameter server by colocating server and worker roles across all devices.
- Implement ODC using RDMA-based interfaces (CUDA IPC for intra-node; NVSHMEM for inter-node) and Triton-Distributed kernels for data transfers.
- Integrate ODC with FSDP by replacing collectives and collecting accumulated gradients at minibatch boundaries.
- Propose load-balancing strategies that shift packing decisions from microbatch to minibatch level to simplify and improve balance.

Experimental results
Research questions
- RQ1Can On-Demand Communication (ODC) reduce synchronization barriers and idle times caused by workload imbalance in LLM post-training?
- RQ2Does integrating PS-like decoupled communication into FSDP preserve memory efficiency while improving throughput under imbalanced workloads?
- RQ3What load-balancing strategies best complement ODC at minibatch granularity for long-context LLM training?
- RQ4How does ODC perform on supervised fine-tuning and reinforcement learning tasks across model scales from 1.5B to 32B parameters?
- RQ5What are the limitations of inter-node ODC communication and potential mitigations?
Key findings
- ODC consistently improves device utilization and end-to-end throughput across SFT and RL tasks.
- ODC achieves up to 36% speedup over standard FSDP in long-context SFT scenarios.
- Idle times due to workload imbalance can reach up to 50% in long-sequence supervised fine-tuning with traditional FSDP.
- Reframing FSDP as a decentralized PS with on-demand point-to-point transfers mitigates stragglers and relaxes microbenchmark synchronization.
- LB-Mini and LB-Micro load-balancing variants enable effective minibatch-level balancing and often outpace baseline in RL and SFT settings.
- ODC remains competitive with collective methods within a single node but shows inter-node communication overhead, which can be mitigated by design choices such as hybrid sharding and overlapping communication with computation.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.