QUICK REVIEW

[논문 리뷰] SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Saleh Ashkboos, Maximilian L. Croci|arXiv (Cornell University)|2024. 01. 26.

Topic Modeling인용 수 8

한 줄 요약

SliceGPT는 orthogonal 변환을 적용하고 가중치 행렬의 주요 행/열을 제거하여 대형 언어 모델을 압축하며, 성능 손실을 최소로 하고 추론을 더 빠르게 하여 최대 30%의 압축을 달성합니다.

ABSTRACT

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

연구 동기 및 목표

사후 훈련 후 대형 언어 모델(LLMs)의 메모리 및 계산 비용을 줄이는 것을 목표로 한다.
임베딩 및 계층 행렬을 축소하면서 기능을 보존하는 새로운 희소화 패러다임을 도입한다.
광범위한 미세조정 없이도 GPU 요구를 줄여 더 빠른 추론을 가능하게 한다.
트랜스포머 네트워크의 계산 불변성에 대한 이론적 및 실증적 통찰을 제공한다.

제안 방법

활성화를 주성분에 PCA로 프로젝션하여 각 가중치 행렬을 더 작은 밀집 행렬로 교체한다.
적절한 역변환이 뒤따를 때 출력을 보존하는 직교 변환을 활용하는 계산 불변성을 도입한다.
정규화 계층을 RMSNorm으로 변환하여 불변 변환을 가능하게 한다.
레이어별 직교 변환 Q_l를 적용하고 잔여 연결을 조정하여 등가성을 유지한다.
성능에 미치는 영향이 최소인 부소수 주성분을 제거하고 W_in, W_out, W_embd에서 해당하는 행/열을 제거하여 슬라이스한다.
보정 데이터셋을 사용하여 Q_l를 보정하고 계층 활성화에 대해 PCA로 계산하여 데이터 기반 슬라이싱을 가능하게 한다.

실험 결과

연구 질문

RQ1직교 변환이 RMSNorm으로 연결된 네트워크의 계층 간 트랜스포머 출력의 보존을 가능하게 하는가?
RQ2레이어별 PCA 기반 슬라이싱이 제로샷 및 생성 성능을 보존하면서 매개변수 수와 임베딩을 의미 있게 감소시키는가?
RQ3약 30%까지 슬라이스할 때 성능 및 처리량의 트레이드오프는 무엇인가?
RQ4슬라이스 이후의 회복형 미세조정이 정확도 유지를 위해 필요한가, 또는 이익이 있는가?

주요 결과

Method	OPT 125M	OPT 1.3B	OPT 2.7B	OPT 6.7B	OPT 13B	OPT 30B	OPT 66B	Llama-2 7B	Llama-2 13B	Llama-2 70B
Dense	27.64	14.61	12.46	10.85	10.12	9.56	9.33	5.47	4.88	3.32
SparseGPT 2:4	45.07	29.61	14.90	13.00	11.80	10.53	10.22	8.69	7.07	4.98
SliceGPT ( 10%)	29.34	15.10	12.75	10.92	10.27	9.65	9.43	5.89	5.21	3.69
SliceGPT ( 20%)	34.26	16.43	13.73	11.48	10.66	9.87	9.57	6.64	5.81	4.25
SliceGPT ( 25%)	37.74	17.46	14.56	11.90	10.94	10.04	9.68	7.24	6.30	4.60
SliceGPT ( 30%)	43.98	19.09	15.83	12.51	11.33	10.27	9.85	8.12	6.99	5.05

SliceGPT는 OPT 및 Llama-2 모델을 최대 30%(임베딩 포함)까지 압축하면서도 높은 작업 성능을 유지할 수 있다.
WikiText-2에서 25% 슬라이싱을 적용한 SliceGPT는 모델 크기에 관계없이 거의 밀집에 가까운 perplexity를 유지하며 SparseGPT 2:4 기준선을 능가한다.
제로샷 작업에서 SliceGPT는 더 큰 모델에서 밀집 모델과 비례하는 정확도를 달성하며, OPT가 일반적으로 Llama-2보다 압축에 더 적합하다.
Alpaca 유사 데이터를 사용한 회복 미세조정(RFT)과 결합하면 잘게 자른 대형 모델이 상당한 정확도를 회복하는데, 예를 들어 70B OPT 슬라이스가 RFT와 함께 다운스트림 작업에서 밀집 성능에 근접한다.
SliceGPT는 의미 있는 처리량 이득을 제공한다: 80GB H100 GPU에서 25% 슬라이싱은 처리량을 최대 1.55배까지 증가시키고, 50% 슬라이싱은 가장 큰 모델에서 필요 GPU를 두 대에서 한 대로 줄일 수 있다.
이 방법은 대량 재훈련 없이도 한 번의 사후 학습 절차로 상당한 매개변수 감소와 속도 향상을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.