QUICK REVIEW

[论文解读] Duration Aware Scheduling for ASR Serving Under Workload Drift

Darshan Makwana, Yash Jogi|arXiv (Cornell University)|Mar 11, 2026

Speech Recognition and Synthesis被引用 0

一句话总结

本文表明音频时长可作为ASR处理时间的代理，并将最短作业优先（SJF）与最高响应比次序（HRRN）集成到 vLLM 中以降低端到端延迟，其中 SJF 在中位数上有显著提升，HRRN 在漂移下对尾部延迟有界限作用。

ABSTRACT

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to $73\%$ at high load, but increases $90$th-percentile tail latency by up to $97\%$ due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to $28\%$ while bounding tail-latency degradation to at most $24\%$. These gains persist under workload drift, with no throughput penalty and $<0.1$\,ms scheduling overhead per request.

研究动机与目标

在变量工作负载下为 ASR 服务管道减少端到端延迟提供动机。
证明音频时长与 Whisper 风格的 ASR 模型的处理时间相关，可用于调度引导。
在真实且带漂移的工作负载下对比 FCFS 评估时长感知调度。
提供可部署在生产 ASR 服务引擎中的 SJF 与 HRRN 实现。

提出的方法

经验性建立音频时长与输出令牌数之间的线性相关性，为作业的时长估算提供基础。
实现两种经典调度策略——最短作业优先（SJF）与最高响应比次序（HRRN）——以时长作为估算作业长度的度量。
将这些调度器集成到 vLLM 引擎中，并在 LibriSpeech test-clean 的 Whisper large-v3 上以及一个合成的均匀时长工作负载上进行评估。
使用到达率为泊松分布、速率滚动变化的工作负载漂移来模拟负载漂移，并在 P50 与 P90 分位点测量端到端延迟（E2EL）与首次输出令牌时间（TTFT）。
保持调度开销极小（每次请求 <0.1 ms），并评估吞吐量以确保没有惩罚。
讨论局限性（静默敏感性、适应性 κ、动态策略切换）并提出实际缓解策略。

Figure 1: Toy example illustrating head-of-line blocking under FCFS and the benefit of duration-aware scheduling. Three requests arrive in order $R_{1},R_{2},R_{3}$ with audio durations $8$ s, $4$ s, and $2$ s. We assume a constant encoder cost of $1$ s per request and a decoding rate of $5$ output

实验结果

研究问题

RQ1音频时长是否能可靠预测像 Whisper 这样的编码–解码模型的处理时间？
RQ2在高负载或工作负载漂移下，基于时长的调度（SJF 与 HRRN）是否能在不显著增加尾部延迟的情况下降低中位端到端延迟？
RQ3在 ASR 服务中，SJF 和 HRRN 相对于 FCFS 在吞吐量和调度开销方面有何差异？
RQ4时长感知调度的收益是否在模型规模和不同时长分布下具有普适性？
RQ5在生产环境中部署时长感知调度需要注意哪些实际考虑与局限性？

主要发现

在 LibriSpeech test-clean 的高负载下，SJF 将中位端到端延迟降低最多 73%，但由于长期请求被挤占，90 分位尾部延迟提高最多 97%。
HRRN 在 LibriSpeech 工作负载下将中位端到端延迟降低最多 28%，并将90分位尾部延迟的下降幅度限制在不超过 24%。
两种策略的每次请求调度开销均小于 0.1 ms，且在测试条件下吞吐量与 FCFS 相同。
在工作负载漂移条件下，结合 Whisper-large-v3 在 LibriSpeech 和一个合成的均匀时长工作负载上，收益依然存在，表明收益来自重新排序而非单纯的时长偏斜利用。
TTFT（到首次 token 的时间）在 SJF 下中位数改善更大，在高负载的 LibriSpeech 上下降最多 93%。
在突发工作负载（无限到达率）下，HRRN 在各延迟指标上提供最稳定的改进，尾部惩罚比 SJF 更小。

Figure 2: Scatter plots showing the relationship between audio duration and ASR output token count. (a) On the LibriSpeech English test set, token count increases linearly with audio duration, indicating a strong correlation. (b) On the FLEURS test sets for Spanish, Hindi, and Arabic, the linear dur

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。