QUICK REVIEW

[論文レビュー] Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Bo-Kyeong Kim, Geonmin Kim|arXiv (Cornell University)|Feb 5, 2024

Natural Language Processing Techniques被引用数 8

ひとこと要約

LoRA retraining を用いた Transformer ブロックの除去による Depth pruning は、ゼロショットタスクで幅 pruning に対抗でき、メモリ制約のある小バッチ条件下で推論を高速化する。

ABSTRACT

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. In retraining pruned models for quality recovery, continued pretraining on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios. We hope this work can help build compact yet capable LLMs. Code and models can be found at: https://github.com/Nota-NetsPresso/shortened-llm

研究の動機と目的

メモリ制約下の小さなバッチ設定において、LLMの推論効率を改善する動機づけ。
残りの重み形状を固定したまま、全 Transformer ブロックを削除する単純な深さプリーニング手法を提案する。
公開LLM（LLaMA-7B および Vicuna-7B/13B）上で、Wanda-sp、FLAP、LLM-Pruner などの幅プリーニングのベースラインと比較して深さプリーニングを評価する。
深さプリーニングと LoRA retraining を組み合わせることで、推論生成を高速化しつつ、ゼロショットタスクの性能を競合的に達成できることを実証する。

提案手法

Transformer ブロックをプリーニング可能な単位として扱い、推論待機時間を削減する。
Mag、Taylor、PPLベースの基準を用いてブロックの重要性を評価し、Taylor+ と PPL をプリーニング決定に採用する。
最も重要でないブロックを削除してターゲットのパラメータ数を満たすワンショット・プリーニングを実行する（先頭の4ブロックと末尾の2ブロックを維持する）。
キャリブレーションデータセット上で LoRA（low-rank adaptation）を用いて、剪定後のモデルを効率的に再訓練し、性能の早期回復を実現する。
ゼロショットタスクにおける深さプリーニングと幅プリーニングのベースラインを比較し、小バッチ条件下での待機時間、スループット、メモリ使用量を測定する。

実験結果

リサーチクエスチョン

RQ1Transformerブロックの単純な深さプリーニングは、大規模言語モデルのゼロショット性能において、幅プリーニングと同等またはそれを超えることができるか。
RQ2メモリ制約のある小バッチの状況で、深さプリーニングされたモデルは自動回帰生成に実際のスピードアップを提供するか。
RQ3ブロックレベルの重要性基準とプリーニングの粒度は、精度と効率の最良のトレードオフをどのようにもたらすか。
RQ4LoRA retraining を伴うワンショット深さプリーニングは、実務上の反復プリーニングアプローチと競合できるか。

主な発見

深さプリーニングは、小バッチ条件下で元のモデルよりも生成速度を速くする。
LoRA で再訓練した場合、深さプリーニングは幅プリーニングのベースライン（Wanda-sp、FLAP、LLM-Pruner）と競合するゼロショットタスクの性能を達成する。
Taylor+ は常識的推論の精度を改善し、PPL は生成品質を改善する。
深さプリーニングされたモデルはGPUメモリ要件を削減し、ハードウェア制約下で未剪定モデルよりも大きなバッチサイズや長い出力に対応できる。
LoRA retraining を伴うワンショット・プリーニングは、反復プリーニングの性能に近づき、効率的なデプロイを可能にする。
Transformer ブロック全体をプリーニングする方が、より大規模なモデルでは個別の MHA/FFN モジュールをプリーニングするより一般的に良い結果をもたらす。小型モデルでは、ブロックプリーニングが依然有利である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。