QUICK REVIEW

[論文レビュー] MineDraft: A Framework for Batch Parallel Speculative Decoding

Zhenwei Tang, Arun Verma|arXiv (Cornell University)|Feb 24, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

MineDraftは、2つの同時バッチを使用して drafting と verification を重ね合わせるバッチ並列推測デコード（PSD）を導入し、標準的な推測デコードに比べてスループットとレイテンシを大幅に改善します。

ABSTRACT

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

研究の動機と目的

推論の高速化を促進するため、推測デコードにおける drafting レイテンシの削減を狙う。
draftingとverificationを重ね合わせるバッチ並列PSDフレームワークを提案する。
現実的な仮定の下で、標準SDに対するPSDの効率向上を理論的に分析する。
MineDraftをvLLMプラグインとして実装し、複数モデルとデータセットで実用性を評価する。

提案手法

drafting/verificationを交互に切り替え、2つのリクエストバッチを維持する新しいバッチ並列設計を導入する。
ドラフトモデルを別GPUで実行し、トークンを直接GPU同士の通信でターゲットモデルへ転送する。
穏やかな仮定の下でPSDがエンドツーエンドのレイテンシを少なくとも37%低減する理論分析を提供する。
MineDraftが標準SDに対して最大75%の平均スループット向上と最大39%のエンドツーエンドレイテンシ低減をもたらすことを示す。
本番運用向け推論ライブラリvLLMのプラグインとしてMineDraftを統合し、継続的なバッチ処理とPagedAttentionをサポートする。

Figure 1: MineDraft parallelizes drafting and verification: a draft model generates tokens while the target model simultaneously verifies the previously generated draft tokens, thereby hiding drafting latency and improving overall inference throughput.

実験結果

リサーチクエスチョン

RQ1 draftingとverificationを重ね合わせて drafting レイテンシを隠すにはどうすれば良いか。
RQ2現実的なドラフト品質曲線の下で、バッチPSDの理論的レイテンシ利益は標準SDと比べてどの程度か。
RQ32バッチ構成のMineDraftは、複数モデルやドラフト戦略で実運用に近い設定でどのように性能を発揮するか。

主な発見

PSDはドラフト/検証ダイナミクスに条件が揃えばエンドツーエンドレイテンシを少なくとも37%低減できる。
MineDraftはモデル設定とデータセット全体で標準SDに対して最大75%の平均スループット向上を達成。
ベストベースライン手法より最大65.02%のスループット向上を報告。
ドラフトモデルを別GPUに配置することでメモリ競合を緩和し、並行 draftingを実現。
EAGLEやTETRISなど既存の drafting戦略とMineDraftを統合することで追加の性能向上を得られる。
実装済みのvLLMプラグインは実用的なデプロイ性とPagedAttentionへの互換性を示す。

Figure 2: Architecture overview of MineDraft . (Left) The Scheduler manages request life-cycles and batch IDs by coordinating with the Batch Manager , which maintains two batches to enable parallelism in MineDraft . (Right) Parallel execution timeline of the Drafter and Verifier across speculative d

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。