QUICK REVIEW

[论文解读] MineDraft: A Framework for Batch Parallel Speculative Decoding

Zhenwei Tang, Arun Verma|arXiv (Cornell University)|Feb 24, 2026

Natural Language Processing Techniques被引用 0

一句话总结

MineDraft 引入批量并行的 PSD，通过两批并发进行 drafting 和 verification 的重叠，显著提升吞吐量和降低延迟。

ABSTRACT

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

研究动机与目标

通过降低 speculative decoding 中的 drafting 延迟来提速大语言模型推理的动机。
提出一个批量并行的 PSD 框架，使 drafting 与 verification 重叠。
在现实假设下理论分析 PSD 相较于标准 SD 的效率提升。
通过将 MineDraft 实现为 vLLM 插件并在多模型与数据集上评估来展示其实用性。

提出的方法

提出一种新颖的批量并行设计，维持两批请求并轮流在两批之间进行 drafting/verification。
在单独的 GPU 上运行 draft 模型，并通过直接的 GPU-to-GPU 通信将 token 传输至目标模型。
给出理论分析，在温和假设下 PSD 至少实现端到端延迟降低 37%。
表明 MineDraft 相较于标准 SD 可实现高达 75% 的平均吞吐量提升以及高达 39% 的端到端延迟降低。
将 MineDraft 整合为生产就绪推理库 vLLM 的插件，并支持连续 batching 与 PagedAttention。

Figure 1: MineDraft parallelizes drafting and verification: a draft model generates tokens while the target model simultaneously verifies the previously generated draft tokens, thereby hiding drafting latency and improving overall inference throughput.

实验结果

研究问题

RQ1如何在 speculative decoding 中通过 overlapped drafting 与 verification 来隐藏 drafting 延迟？
RQ2在现实的草案质量曲线下，批量 PSD 相比标准 SD 的理论延迟收益是多少？
RQ3在生产环境中不同模型和 drafting 策略下，两批 MineDraft 设计的表现如何？

主要发现

在草案/验证动态下的特定条件下，PSD 能将端到端延迟降低至少 37%。
MineDraft 在不同模型设置与数据集下，平均吞吐量较标准 SD 最高提升可达 75%。
在最佳基线方法之上，吞吐量提升最高可达 65.02%。
通过将 draft 模型放置在另一块 GPU，MineDraft 缓解了内存竞争，实现并行 drafting。
将 MineDraft 与现有 drafting 策略（如 EAGLE 或 TETRIS）整合可带来额外性能提升。
实现的 vLLM 插件 demonstrated 实际部署的可行性，并与 PagedAttention 兼容。

Figure 2: Architecture overview of MineDraft . (Left) The Scheduler manages request life-cycles and batch IDs by coordinating with the Batch Manager , which maintains two batches to enable parallelism in MineDraft . (Right) Parallel execution timeline of the Drafter and Verifier across speculative d

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。