QUICK REVIEW

[논문 리뷰] ReLoRA: High-Rank Training Through Low-Rank Updates

Vladislav Lialin, Namrata Shivagunde|arXiv (Cornell University)|2023. 07. 11.

Advanced Neural Network Applications인용 수 11

한 줄 요약

ReLoRA는 재시작으로 다수의 저랭크 업데이트를 누적하여 고랭크 트랜스포머 네트워크를 학습시키고, GPU RAM 사용을 줄이며 학습 속도를 높이는 한편, 전체 랭크 학습과 비교할 만한 성능을 달성한다.

ABSTRACT

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.

연구 동기 및 목표

매우 큰 트랜스포머 모델의 매개변수 효율적 사전 학습 동기를 부여합니다.
고랭크 업데이트를 순차적 저랭크 업데이트를 통해 얻을 수 있는지 조사합니다.
재시작, 톱니 모양 학습률 스케줄, 부분 최적화기 재설정을 갖춘 ReLoRA를 개발합니다.
최대 1.3B 파라미터의 트랜스포머에서 ReLoRA를 시연하고 LoRA 및 전체 랭크 학습과 비교합니다.

제안 방법

따뜻한 시작(워밍업) 전체 랭크 학습 기준선을 시작점으로 사용합니다.
랭크 r=128의 선형 계층에 LoRA 스타일의 저랭크 업데이트를 적용합니다.
다중 재시작을 사용해 저랭크 업데이트를 기본 가중치로 합칩니다(업데이트의 합).
각 합병-재초기화 후 제로 워밍업이 있는 톱니 꼬인 코사인 학습률 스케줄을 사용합니다.
왜곡된 그래디언트 모멘트가 업데이트를 이끄는 것을 방지하기 위해 크기(pruned magnitude)에 따른 가변적 최적화기 상태 재설정을 수행합니다.
ReLoRA로 선형 계층을 업데이트하는 동안 임베딩 및 정규화는 고랭크로 유지합니다.

Figure 1: Training loss for 250M models. ReLoRA learns a high-rank network through a sequence of low-rank updates. It outperforms networks with the same trainable parameter count and achieves similar performance to training a full network at 100M+ scale. The efficiency of ReLoRA increases with the m

실험 결과

연구 질문

RQ1고랭크 네트워크가 일련의 저랭크 업데이트를 통해 효과적으로 학습될 수 있는가?
RQ2고랭크 ReLoRA의 성능과 효율성이 모델 크기에 따라 LoRA 및 전체 랭크 학습과 어떻게 비교되는가?
RQ3ReLoRA의 성공에 필수적인 학습 기법(재시작, 옵티마이저 재설정, 워밍업 시작)은 무엇인가?
RQ4더 큰 트랜스포머 모델(최대 1.3B 파라미터)에서 ReLoRA의 효율성 및 성능이 스케일링되는가?

주요 결과

ReLoRA는 GPU당 최대 5.5GB의 RAM을 절약하고 모델 크기와 하드웨어에 따라 학습 속도를 9-40%까지 향상시킵니다.
ReLoRA는 전체 랭크 학습에 근접한 perplexities를 달성하고 LoRA를 능가하며, 1.3B 모델의 끝에서 17.27의 perplexity를 기록하고 전체 학습은 16.83입니다.
특이값 분석은 ReLoRA의 업데이트가 LoRA의 주로 0/저랭크 스펙트럼과 달리 고랭크/전체 학습에 더 비슷하게 분포함을 보여줍니다.
1.3B 모델의 경우, 워밍 스타트와 재시작을 가진 ReLoRA가 LoRA보다 학습 전체에 걸쳐 더 나은 성능을 달성하고 전체 랭크 학습과의 차이를 좁힙니다(최종 perplexity 17.27 대 16.83).
훈련 속도 향상은 하드웨어에 따라 다릅니다. 8x A100 구성에서 ReLoRA는 약 9%의 월래 클록 속도 향상을 제공하며 저렴한 하드웨어에서 더 큰 이점을 보입니다.
온라인 ReLoRA(매우 자주 재설정)는 이 연구에서 표준 ReLoRA에 비해 결과를 개선하지 못했습니다.

Figure 2: Jagged cosine scheduler used in ReLoRA. As a base for our scheduler we follow a standard cosine decay schedule as in Touvron et al. ( 2023 ) . On every optimizer reset, we set the learning rate to zero and perform a quick (50-100 steps) learning rate warm-up back to the cosine schedule.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.