QUICK REVIEW

[論文レビュー] A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference

Yida Zhang, Zhiyong Gao|arXiv (Cornell University)|Mar 19, 2026

Big Data and Digital Economy被引用数 0

ひとこと要約

PicoSpec はトレーニング不要のエッジ-クラウド推定実行フレームワークを導入し、エッジでの下書きとクラウドでの検証を分離することでWAN待機時間をマスクし、並行下書きと別個のリジェクションサンプリングを用いて最大で 2.9× の速度向上を実現する。

ABSTRACT

Recent advancements and widespread adoption of Large Language Models (LLMs) in both industry and academia have catalyzed significant demand for LLM serving. However, traditional cloud services incur high costs, while on-device inference alone faces challenges due to limited resources. Edge-cloud collaboration emerges as a key research direction to combine the strengths of both paradigms, yet efficiently utilizing limited network bandwidth while fully leveraging and balancing the computational capabilities of edge devices and the cloud remains an open problem. To address these challenges, we propose Pipelined Collaborative Speculative Decoding Framework (PicoSpec), a novel, general-purpose, and training-free speculative decoding framework for LLM edge-cloud collaborative inference. We design an asynchronous pipeline that resolves the mutual waiting problem inherent in vanilla speculative decoding within edge collaboration scenarios, which concurrently executes a Small Language Model (SLM) on the edge device and a LLM in the cloud. Meanwhile, to mitigate the significant communication latency caused by transmitting vocabulary distributions, we introduce separate rejection sampling with sparse compression, which completes the rejection sampling with only a one-time cost of transmitting the compressed vocabulary. Experimental results demonstrate that our solution outperforms baseline and existing methods, achieving up to 2.9 speedup.

研究の動機と目的

エッジ-クラウド協調によるリソース制約デバイスでの効率的なLLM推論を動機づける。
エッジ下書きとクラウド検証を分離するトレーニング不要の非同期パイプラインを開発する。
別個のリジェクションサンプリング機構とスパース圧縮で通信オーバーヘッドを削減する。

提案手法

Parallel Drafter、Rejection Sampler、Speculative KV Cache、Zero-Copy Communicator の4つのエッジモジュールを提案する。
Verifier、Request Handler、KV Cache、Zero-Copy Communicator の4つのクラウドモジュールを実装する。
Parallel Drafting と Fast Verification を可能にして、エッジ下書きとクラウド検証を重ね合わせ、パイプラインのバブルを最小化する。
Top-K のスパース圧縮を用いた Separate Rejection Sampling で、確率の高い候補のみを伝送し、再学習なしで帯域を回復する。
誤予測後の状態整合性を保つための遅延認識のロールバック機構を提供する。
エンドツーエンドのスループットを分析し、遅延耐性の特性を導出するための確率的性能モデルを提供する。

Figure 1: Comparison between (a) Cloud autoregressive decoding, (b) Cloud speculative decoding, (c) Vanilla collaborative speculative decoding, and (d) PicoSpec.

実験結果

リサーチクエスチョン

RQ1高遅延 WAN 環境でエッジとクラウドの構成要素をどのように分離して真の並列推論を実現できるか。
RQ2トレーニング不要の非同期パイプラインは、ネットワーク遅延をマスクしつつモデルの一般性を維持できるか。
RQ3別個のリジェクションサンプリング scheme とスパース圧縮は、精度を犠牲にせずアップリンク/ダウンリンク帯域を削減できるか。
RQ4さまざまな下書き長さと承認率の下で、PicoSpec の理論的および実証的なスループット向上はどの程度か。

主な発見

PicoSpec は高遅延のエッジ-クラウド環境でベースラインより最大 2.9× の速度向上を実現する。
非同期パイプライン（Parallel Drafting）はエッジの待機時間を削減し、下書きとクラウド検証を重ね合わせることでスループットを RTT ではなくエッジ下書き速度で制限する。
Fast Verification はクラウド側の準備を完全な下書き到着前に可能にすることでパイプラインのバブルをさらに削減する。
Separate Rejection Sampling と Top-K スパース圧縮により下りデータを O(V) から O(K) に削減し、通信オーバーヘッドを大幅に低減する。
アブレーション研究により、非同期パイプライン、Fast Verification、Split-Rej のそれぞれが重要であり、Para-draft なしは最も大きなスループット低下を引き起こす。
下書き長さの最適化（n）により、n=4 でピークスループットを達成し、実用的な範囲での n に対して堅牢な性能を示す。

Figure 2: System Overview of PicoSpec Framework.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。