[论文解读] A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
PicoSpec 引入了一种无需训练、异步的边云边推测性解码框架,将边端 drafting 与云端 verification 解耦,使用并行 drafting 和独立的拒绝采样来掩盖 WAN 延迟,实现高达 2.9× 的加速。
Recent advancements and widespread adoption of Large Language Models (LLMs) in both industry and academia have catalyzed significant demand for LLM serving. However, traditional cloud services incur high costs, while on-device inference alone faces challenges due to limited resources. Edge-cloud collaboration emerges as a key research direction to combine the strengths of both paradigms, yet efficiently utilizing limited network bandwidth while fully leveraging and balancing the computational capabilities of edge devices and the cloud remains an open problem. To address these challenges, we propose Pipelined Collaborative Speculative Decoding Framework (PicoSpec), a novel, general-purpose, and training-free speculative decoding framework for LLM edge-cloud collaborative inference. We design an asynchronous pipeline that resolves the mutual waiting problem inherent in vanilla speculative decoding within edge collaboration scenarios, which concurrently executes a Small Language Model (SLM) on the edge device and a LLM in the cloud. Meanwhile, to mitigate the significant communication latency caused by transmitting vocabulary distributions, we introduce separate rejection sampling with sparse compression, which completes the rejection sampling with only a one-time cost of transmitting the compressed vocabulary. Experimental results demonstrate that our solution outperforms baseline and existing methods, achieving up to 2.9 speedup.
研究动机与目标
- 通过边云协作,在资源受限的边缘设备上实现高效的 LLM 推理。
- 开发一个无需训练的异步管线,将边缘 drafting 与云端 verification 解耦。
- 通过独立的拒绝采样机制和稀疏压缩来降低通信开销。
提出的方法
- 提出 PicoSpec,包含四个边缘模块:Parallel Drafter、Rejection Sampler、Speculative KV Cache、Zero-Copy Communicator。
- 实现四个云端模块:Verifier、Request Handler、KV Cache、Zero-Copy Communicator。
- 实现 Parallel Drafting 与 Fast Verification 的并行,使边缘 drafting 与云端 verification 重叠,最小化管线气泡。
- 使用 Separate Rejection Sampling 与 Top-K 稀疏压缩,仅传输高置信度候选并在不重新训练的情况下回收带宽。
- 提供一个对延迟敏感的回滚机制,以在误预测后维持状态一致性。
- 给出一个概率性能模型,用于分析端到端吞吐量并推导延迟免疫性特性。

实验结果
研究问题
- RQ1如何在高时延广域网环境中解耦边云组件,以实现真正的并行推测解码?
- RQ2在边云 LLM 推理中,是否可在无需训练的异步管线下掩盖网络延迟,同时保持模型泛化性?
- RQ3独立的拒绝采样方案与稀疏压缩是否能在不牺牲准确性的前提下降低上行/下行带宽?
- RQ4在不同 drafting 长度和接受率下,PicoSpec 的理论与经验吞吐量提升是多少?
主要发现
- PicoSpec 在高时延边云场景下实现了相较基线最多 2.9× 的加速。
- 异步管线(Parallel Drafting)通过将 drafting 与云端 verification 重叠,消除了边缘空闲时间,使吞吐量受边缘 drafting 速率而非 RTT 限制。
- Fast Verification 通过在完整草稿到达前就进行云端准备,进一步减少管线气泡。
- Separate Rejection Sampling 与 Top-K 稀疏压缩将下行数据从 O(V) 降至 O(K) 每轮,显著降低通信开销。
- 消融研究显示异步管线、Fast Verification 与 Split-Rej 都很重要,其中缺少 Para-draft 时吞吐量下降最大。
- Draft 长度优化(n)在 n=4 时达到峰值吞吐量,在实际范围内对 n 具有鲁棒性。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。