Skip to main content
QUICK REVIEW

[论文解读] Speculative Decoding with Big Little Decoder

Sehoon Kim, Karttikeya Mangalam|arXiv (Cornell University)|Feb 15, 2023
Topic Modeling被引用 8
一句话总结

BiLD 将一个小的自回归解码器与一个大的非自回归解码器相结合,利用回退与回滚策略以在尽可能少的质量损失下加速文本生成。

ABSTRACT

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced

研究动机与目标

  • 推动自回归文本生成推理延迟的降低,尽量减少质量损失。
  • 引入一个即插即用的框架,在不重新训练基础模型的情况下协同两个不同规模的解码器。
  • 提出简单的策略(回退和回滚)以决定何时让大型模型介入、何时回滚预测。
  • 在机器翻译与摘要基准测试中展示适用性,并提供开源实现。

提出的方法

  • 两模型解码:小模型自回归生成 token, Large 模型非自回归对预测进行 refine。
  • 回退策略:当小模型的置信度(max p_S)低于阈值 alpha_FB 时触发大型模型推理。
  • 回滚策略:当大型模型的分布与前期小模型预测偏离超过 alpha_RB 时可以覆盖前面的预测并回滚后续 token。
  • 预测对齐技术:使用校准数据对齐小模型与大模型输出,降低不必要的回滚。
  • 算法 1 详细描述端到端的 BiLD 过程以及每个解码步骤的策略检查。

实验结果

研究问题

  • RQ1一个小型自回归模型是否可以提供大部分生成,同时通过偶尔的大型模型 refinement 来降低延迟?
  • RQ2简单的回退与回滚策略在不同任务上是否能在可接受的质量损失下带来显著的加速?
  • RQ3独立训练的小模型和大模型之间的对齐是否能提升 BiLD 的性能?
  • RQ4BiLD 能否与早停策略连接并在专门训练流程下保持竞争力?
  • RQ5相较于完全自回归解码,BiLD 在翻译与摘要基准上的表现如何?

主要发现

Task (Model)BLEU (IWSLT)Speedup (IWSLT)BLEU (WMT)Speedup (WMT)ROUGE-L (XSUM)Speedup (XSUM)ROUGE-L (CNN/DM)Speedup (CNN/DM)
Vanilla Inference (large)40.32-31.38-35.08-41.54-
BiLD (Unaligned)40.331.43x31.281.34x35.121.48x41.441.71x
BiLD (Unaligned) Degraded39.441.58x30.471.43x34.021.72x40.572.05x
BiLD (Aligned)40.241.62x31.261.47x35.051.50x41.521.85x
BiLD (Aligned) Degraded39.131.78x30.331.70x33.951.80x40.962.12x
  • BiLD 在某些任务上实现了最高 2.12x 的端到端加速,且 BLEU/ROUGE-L 约降幅为 1 点。
  • 未对齐的 BiLD 在基准中平均获得 1.50x 的加速,在某些任务中质量不变;对齐的 BiLD 平均 1.61x,最大可达 1.85x 的加速。
  • 预测对齐进一步提升性能,相较于未对齐的 BiLD 取得更高的加速和更好的质量。
  • 消融研究显示回滚和回退策略对维持质量与时延优势均至关重要。
  • BiLD 可扩展到早停场景,在 MT 基准上实现最高 1.74x 的加速且 BLEU 损失很小。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。