QUICK REVIEW

[论文解读] Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Zangwei Zheng, Xiaozhe Ren|arXiv (Cornell University)|May 22, 2023

Topic Modeling被引用 9

一句话总结

该论文提出一个由大语言模型驱动的推理管线，通过对未来回答长度的感知来将查询按相似长度分成微批，从而在Vicuna-7B上实现高达86%的吞吐量提升且不牺牲质量。

ABSTRACT

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM inference.

研究动机与目标

研究LLMs是否能够感知其即将给出的回答长度（回答长度感知）。
利用长度感知设计一种按预测长度分组的序列调度系统。
在自回归 LLM 推理中减少冗余计算并在不影响性能的前提下提高吞吐量。
提出机制（失败收集、重新计算、变批量大小）以增强鲁棒性与效率。

提出的方法

证明通过先验感知（PiA）方法，指令微调的 LLM 能够预测回答长度。
通过基于 LoRA 的训练的长度预测器，用指令微调增强管线，以实现预测与生成解耦。
开发一个按预测长度分组查询的序列调度系统，并采用失败收集与重新计算（FCR）来处理误预测。
引入可变批量大小（VBS），使批量大小适应预测长度并管理内存约束。
使用分箱策略预测最大长度（四次生成的最大长度），以减少失败的重新收集。
在实际指令数据集上使用 Vicuna-7B 在 80GB A100 上进行评估，与普通批量推理进行吞吐量对比。

实验结果

研究问题

RQ1LLMs 在自回归解码前是否能可靠地预测其回答长度（PiA 与 PO 的对比）？
RQ2利用回答长度感知进行序列调度是否能在不损害质量的前提下提升推理吞吐量？
RQ3哪些缓解措施（FCR、VBS、分箱）对稳健、可扩展的 LLM 推理有效？
RQ4拟议方法与现有加速技术（如 Flash Attention、量化）如何协同工作？

主要发现

基于 PiA 的长度预测器在吞吐量提升方面具有显著效果，当使用指令微调的预测器时（评估的长度均值或最大值变体）相比普通推理提升了 86%。
指令微调的长度预测在预测回答长度方面明显优于非指令微调或简单池化/MLP 方法。
分箱（Binning）、失败收集与重新计算（FCR）以及可变批量大小（VBS）的组合在不同数据集和设置中带来最大的吞吐量提升。
该方法与其他推理加速方法相互正交，表明可与现有工具包（如 FlashAttention、量化）互补。
在 Vicuna-7B 上的实验证明，在真实世界指令数据集上吞吐量提升，同时生成质量保持在可接受水平或提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。