Skip to main content
QUICK REVIEW

[论文解读] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Huiqiang Jiang, Qianhui Wu|arXiv (Cornell University)|Oct 9, 2023
Topic Modeling被引用 8
一句话总结

LLMLingua 介绍了一种自粗到细的提示压缩框架,动态对指令、演示和问题分配压缩预算,采用逐 token 的迭代压缩,并使小型语言模型与目标大模型对齐,从而在几乎不影响性能的情况下实现高达 20x 的提示压缩。

ABSTRACT

Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.

研究动机与目标

  • Motivate reducing prompt length to lower inference cost while preserving semantic integrity for API-accessible LLMs.
  • Propose a coarse-to-fine prompt compression pipeline that preserves essential information under high compression ratios.
  • Mitigate distribution mismatch between small prompting LMs and target black-box LLMs via instruction tuning.
  • Demonstrate state-of-the-art performance on multiple datasets spanning reasoning, ICL, conversation, and summarization.

提出的方法

  • Budget controller allocates compression budgets across instruction, demonstrations, and questions and performs coarse-grained demonstration-level compression.
  • Iterative token-level prompt compression (ITPC) uses a small LM to estimate conditional token probabilities and retain high-information tokens.
  • Sentence- and demonstration-level dropout preserves linguistic structure under high compression.
  • Distribution alignment via instruction tuning trains the small LM on data generated by the target LLM to reduce distribution gap.
  • Evaluation uses exact match, BLEU, ROUGE, and BERTScore across GSM8K, BBH, ShareGPT, and Arxiv-March23.
  • Grounding claim: achieves up to 20x compression with modest performance drop (e.g., as noted in results).

实验结果

研究问题

  • RQ1How much can prompts be compressed while preserving LLM reasoning and in-context learning capabilities across diverse tasks?
  • RQ2Can a budgeted, coarse-to-fine compression strategy maintain semantic integrity under high compression ratios?
  • RQ3Does aligning the small prompting LM with target LLM distributions improve compression quality and downstream performance?
  • RQ4What are the trade-offs between compression ratio, latency, and accuracy across reasoning, conversation, and summarization benchmarks?

主要发现

  • The method achieves up to 20x compression with only a small performance drop on GSM8K and BBH under certain constraints.
  • Ours consistently outperforms Selective-Context and random sentence/demonstration selection across multiple tasks.
  • LLMLingua preserves ICL capabilities, sometimes surpassing few-shot baselines at high compression ratios.
  • Distribution alignment via instruction tuning yields measurable gains in reasoning benchmarks.
  • End-to-end latency is reduced with modest computational overhead from the compression steps, enabling practical speedups.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。