QUICK REVIEW

[论文解读] ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Xin Men, Mingyu Xu|arXiv (Cornell University)|Mar 6, 2024

Topic Modeling被引用 15

一句话总结

ShortGPT 在大语言模型中显示出显著的层级冗余，并通过移除低 BI 的层来裁剪模型，在参数约减少25%的情况下保持大部分性能，并且与量化无关。

ABSTRACT

As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

研究动机与目标

研究大型语言模型在层级上是否存在超越参数级冗余的冗余。
提出一个度量指标（Block Influence，简称 BI）来量化每一层在LLMs中的重要性。
提出并评估一种简单的受 BI 指导的层移除剪枝方法。
证明层剪枝与量化是正交的，并且可以与其他压缩方法互补。
评估层剪枝在不同基准和模型上的局限性与适用范围。

提出的方法

定义 Block Influence（BI）来衡量在推理过程中某一层对隐藏状态的变换程度。
使用从校准集收集的隐藏状态对 BI 进行标定，并按 BI 对层进行排序。
通过删除 BI 值最小的层来执行层删除（按 BI 递增排序）。
在多个开源 LLM 上对剪枝后的模型在标准基准（MMLU、CMMLU 等）进行评估。
将 ShortGPT 与最先进的剪枝方法进行比较，并分析深度冗余与宽度冗余。
通过对量化的 Llama-2-7B-Base 模型应用剪枝来证明与量化的正交性。

实验结果

研究问题

RQ1是否可以使用分层 BI 度量在 LLMs 中可靠地衡量层级冗余？
RQ2在不同模型和任务中移除低 BI 层时，性能能保留多少？
RQ3在当前的 LLM 架构中，层剪枝主要是基于深度还是基于宽度？
RQ4BI 指导的层移除是否能与量化技术互补，从而进一步降低模型体积？

主要发现

LLMs 显示出显著的层级冗余，尤其是在较深的层次。
BI（Block Influence）能够有效捕捉层的重要性并指导剪枝。
ShortGPT 在减少约25%的参数和计算的同时维持约92%的性能，超过先前的剪枝方法。
层移除（深度剪枝）往往优于像嵌入维度剪枝等宽度压缩方法。
该剪枝方法与量化正交，可以与之结合以实现进一步压缩。
冗余在基于 Transformer 的模型中广泛存在，甚至在像 RWKV 这样的非 Transformer 架构中也存在。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。