QUICK REVIEW

[논문 리뷰] LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Yi-Lin Sung, Jaemin Cho|arXiv (Cornell University)|2022. 06. 13.

Domain Adaptation and Few-Shot Learning인용 수 79

한 줄 요약

래더 사이드 튜닝(LST)은 래더를 통해 중간 백본 활성화를 활용하는 가벼운 사이드 네트워크를 학습시켜 큰 백본을 역전파하지 않고 파라미터- 및 메모리-효율적인 전이 학습을 가능하게 한다.

ABSTRACT

Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g. only using 2% of parameters) inside a pre-trained backbone network for a new task, they only reduce the training memory requirement by up to 30%. This is because the gradient computation for the trainable parameters still requires backpropagation through the large pre-trained backbone model. To address this, we propose Ladder Side-Tuning (LST), a new PETL technique that can reduce training memory requirements by more substantial amounts. Unlike existing parameter-efficient methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (called ladders) from backbone networks and makes predictions. LST has significantly lower memory requirements than previous methods, because it does not require backpropagation through the backbone network, but instead only through the side network and ladder connections. We evaluate our method with various models (T5 and CLIP-T5) on both NLP (GLUE) and vision-and-language (VQA, GQA, NLVR2 , MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and LoRA in a low-memory regime. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models, attaining better GLUE performance than full fine-tuning and other PETL methods. The accuracy-efficiency trade-off also holds on VL tasks.

연구 동기 및 목표

대형 사전 학습 모델에 대한 메모리 및 파라미터 효율적인 전이 학습의 필요성을 제시한다.
학습 중 백본을 통한 역전파를 피하는 사이드 네트워크 접근법을 제안한다.
사이드 네트워크의 구조적 가중치 초기화와 층 드롭으로 효율성을 향상시킨다.
자연어 처리(NLP)(GLUE) 및 비전-언어 작업(VQA, GQA, NLVR2, MSCOCO)에서 LST를 평가하고 PETL 베이스라인과 비교한다.
더 큰 백본(T5-large, T5-3B)으로의 확장성과 메모리 절감 효과를 보여준다.

제안 방법

고정된 백본 f로부터 래더 연결을 통해 중간 활성화를 입력받는 래더 사이드 네트워크 g를 훈련한다.
차원 축소된 사이드 네트워크를 사용하고(축소 비율 r), 각 계층에서 학습 가능한 게이트 μi로 백본과 사이드 표현을 혼합한다.
백본에서 구조적 가지치기(Fisher 정보 또는 가중치 크기)를 통해 사이드 네트워크 가중치를 d_out/r 행, d_in 열로 초기화한다.
메모리와 파라미터를 추가로 줄이기 위해 사이드 네트워크의 층을 선택적으로 드롭한다(레이어 드로핑).
학습 중 역전파는 사이드 네트워크와 래더를 통해서만 이루어지며 백본을 통해서는 이루어지지 않으므로 메모리 사용이 감소한다.
인코더-전용 및 인코더-디코더 변형을 제공하고, 활성화를 다운샘플링/업샘플링하기 위한 선형 투영과 병렬 가능한 순전파를 제시한다.

실험 결과

연구 질문

RQ1래더 사이드 튜닝이 전체 파인튜닝이나 다른 PETL 방법과 비교하여 학습 메모리를 감소시키면서도 경쟁력 있는 태스크 성능을 달성할 수 있는가?
RQ2구조적 초기화와 레이어 드롭이 NLP와 VL 태스크 전반에서 LST의 성능과 효율성에 어떤 영향을 미치는가?
RQ3메모리 이점을 유지하면서 LST를 더 큰 백본(T5-large, T5-3B)으로 확장하는 것이 가능한가?
RQ4래더 연결과 게이팅이 중간 백본 활성화를 태스크 적응에 활용하는 데 미치는 영향은 무엇인가?

주요 결과

LST는 백본을 통한 역전파를 피함으로써 학습 메모리를 줄이고, 저메모리 환경에서 어댑터 및 LoRA와 비교해 비슷하거나 더 나은 정확도로 GLUE에서 전체 파인튜닝 대비 최대 69%의 메모리 절감을 달성한다.
Fisher information 또는 가중치 크기를 사용한 가지치기로 사이드 네트워크 가중치를 초기화하면 사이드 네트워크 크기에 관계없이 성능이 향상된다.
사이드 네트워크의 레이어 드로핑은 성능 저하가 거의 없이 효율성을 크게 향상시킨다.
LST는 더 큰 모델(T5-large, T5-3B)로 확장되며 유사한 메모리 예산 하에서 GLUE 성능이 전체 파인튜닝 및 다른 PETL 방법보다 더 높다.
비전-언어 태스크에서 LST는 현저히 낮은 메모리 사용으로 경쟁력 있는 정확도를 달성하며(메모리 절감 2.7배), 16GB GPU에 약 7.5%의 학습 가능 매개변수로도 충분히 운용할 수 있다.
중간 단축 경로와 초기화 전략의 이점을 확인하는 Ablation에서, 증류(distillation) 또는 가지치기 기반 초기화도 비슷한 정확도를 제공합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.