QUICK REVIEW

[논문 리뷰] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson|arXiv (Cornell University)|2021. 06. 08.

Topic Modeling참고 문헌 64인용 수 82

한 줄 요약

Compacter는 대형 언어 모델의 미세 조정을 위한 저랭크, 하이퍼컴플렉스 어댑터 레이어를 도입하여 전체 파인튜닝 대비 비슷하거나 더 나은 작업 성능을 달성하며 파라미터의 아주 작은 분수(~0.047%)만 학습합니다.

ABSTRACT

Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms standard fine-tuning on SuperGLUE and low-resource settings. Our code is publicly available at~\url{https://github.com/rabeehk/compacter}.

연구 동기 및 목표

대규모 사전학습된 언어 모델(PLM)의 메모리- 및 파라미터에 효율적인 파인튜닝을 동기화한다.
훈련 가능한 파라미터를 줄이면서 NLP 벤치마크에서 작업 성능을 유지하거나 개선하는 어댑터를 개발한다.
Kronecker/저랭크 분해 및 하이퍼컴플렉스 곱셈을 활용하여 컴팩트한 어댑터 레이어를 만든다.
강력한 베이스라인과 비교하여 GLUE와 SuperGLUE에서 경험적으로 평가하고 효율성 트레이드를 분석한다.

제안 방법

공유 느린 가중치와 빠른 1차원(rank-one) 매트릭스 간의 Kronecker 곱의 합을 통해 Compacter 레이어에서 사전학습된 모델 가중치에 태스크 특이적 가중치를 삽입한다.
빠른 구성 요소의 저랭크 매개변수를 사용해 파라미터를 O(k+d)로 감소시키고 어댑터의 O(kd) 대비 감소시킨다.
A_i 매트릭스를 모든 레이어에 공유하고 B_i는 각 레이어에 특이적으로 두고, B_i를 rank r(일반적으로 r=1)로 s_i t_i^T로 추가 인수분해한다.
어댑터의 다운-프로젝션 및 업-프로젝션을 LPHM(저랭크 매개변수화된 하이퍼컴플렉스 곱셈) 레이어로 대체한다.
학습 중 고정된 사전학습 모델을 유지하고 레이어 표준화 및 어댑터를 업데이트한다(표준 어댑터와 동일한 방식).
선택적으로 Compacter ++를 각 블록의 셀프 어텐션 이후 Compacter 레이어를 제거하여 추가 파라미터 축소를 평가한다.

실험 결과

연구 질문

RQ1Compacter가 파라미터를 수십 배 적게 학습하면서 전체 미세조정과 동등한 성능을 낼 수 있는가?
RQ2LPHM 기반 어댑터가 GLUE/SuperGLUE에서 표준 어댑터 및 기타 파라미터 효율적 파인튜닝 방법과 비교하여 어떤 차이가 있는가?
RQ3다양한 n 및 랭크 구성에서 Compacter의 메모리, 학습 시간, 정확도 트레이드는 어떤가?
RQ4레이어 간 A_i 공유와 저랭크 B_i 사용으로 태스크 및 자원 설정 전반에서 성능 유지가 충분한가?

주요 결과

Method	#Total params	Trained params / per task	CoLA	SST-2	MRPC	QQP	STS-B	MNLI	QNLI	RTE	Avg
T5 BASE	8.0×1	100%	61.76	94.61	90.20/93.06	91.63/88.84	89.68/89.97	86.78	93.01	71.94	86.50
PHM-Adapter (n=12)	1.013	0.179%	57.35	94.50	91.67/93.86	90.25/87.05	90.45/90.84	85.97	92.92	75.54	86.40
Compacter (n=4)	1.004	0.073%	63.75	93.00	89.22/92.31	90.23/87.03	90.31/90.74	85.61	92.88	77.70	86.62
Compacter ++ (n=4)	1.002	0.047%	61.27	93.81	90.69/93.33	90.17/86.93	90.46/90.93	85.71	93.08	74.82	86.47

Compacter는 GLUE 및 SuperGLUE에서 전체 미세조정보다 동등하거나 더 나은 성능을 달성하며 파라미터의 0.073%(Compacter) 및 0.047%(Compacter ++))만 학습합니다.
Compacter는 GLUE/SuperGLUE 벤치마크에서 어댑터, Adapter-LowRank, 프롬프트 튜닝 변형 등 여러 파라미터 효율적 베이스라인을 능가합니다.
LPHM 기반 레이어는 표준 어댑터의 O(kd) 대비 O(k+d)로 파라미터 수를 줄이며 레이어 간 공유 A_i 및 rank-one B_i 인수를 사용합니다.
저자원 설정에서 Compacter 및 Compacter ++는 학습 가능한 파라미터가 훨씬 적더라도 정확도 면에서 표준 미세조정보다 뛰어날 수 있습니다.
Compacter ++는 평가된 많은 설정에서 최저의 메모리 사용량으로 전체 미세조정에 거의 맞먹는 성능을 달성할 수 있습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.