QUICK REVIEW

[논문 리뷰] On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang|arXiv (Cornell University)|2020. 02. 12.

Power Transformer Diagnostics and Insulation인용 수 109

한 줄 요약

이 논문은 레이어 정규화의 배치가 트랜스포머 최적화에 미치는 영향을 분석하고, Pre-LN이 워밍업 없이 학습을 가능하게 하며 더 빠르게 수렴하는 반면, Post-LN은 안정성을 위해 워밍업에 의존한다는 것을 보여준다.

ABSTRACT

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

연구 동기 및 목표

Post-LN 트랜스포머에서 학습률 워밍업이 필수적인 이유를 동기 부여하고, 레이어 정규화의 배치가 그래디언트 동작에 어떤 영향을 주는지 설명한다.
Mean field theory를 사용하여 Post-LN 및 Pre-LN 변형에서 초기화 시 그래디언트 규모를 이론적으로 분석한다.
Pre-LN에 대해 워밍업 제거 가능성을 실증적으로 검증하고, NLP 과제 전반에서 학습 속도와 성능을 측정한다.

제안 방법

Post-LN 및 Pre-LN 트랜스포머의 초기화 시 그래디언트 규모를 연구하기 위한 mean field theory.
마지막 FFN 층의 그래디언트 노름과 깊이 L에 대한 의존성에 대한 이론적 분석.
워밍업 vs 무 워밍업 설정을 비교하기 위한 IWSLT14 De-En, WMT14 En-De 및 BERT 사전학습에 대한 실험.
통제된 초기화: 단일 헤드 어텐션, Xavier 초기화, 어텐션의 Q/K 제로, 가우시안 입력.
Adam 및 SGD/RAdam 변형을 사용하여 워밍업 여부에 따라 Post-LN vs Pre-LN 아키텍처를 비교.

실험 결과

연구 질문

RQ1Pre-LN 트랜스포머의 초기화 시 학습률 워밍업 단계가 불필요해지는가?
RQ2레이어 정규화의 배치가 트랜스포머 아키텍처의 그래디언트 규모 및 학습 안정성에 어떤 영향을 미치는가?
RQ3Pre-LN 트랜스포머가 워밍업 없이 Post-LN 기준선에 비해 수렴 속도와 최종 성능에서 동등하거나 더 나은가? 번역 및 사전 학습 과제 전반에서.

주요 결과

Post-LN 트랜스포머는 초기화 시 출력 층 근처에서 큰 그래디언트를 보이며, 워밍업 없이 큰 학습률은 불안정하다.
Pre-LN 트랜스포머는 초기화 시 그래디언트가 잘 작동하므로 워밍업 단계를 제거할 수 있다.
IWSLT14 De-En, WMT14 En-De, 및 BERT 사전학습 전반에서 Pre-LN은 워밍업이 없더라도 속도와 최종 성능 면에서 Post-LN과 대등하거나 우수하다.
동일한 lr_max 설정에서 Pre-LN은 Post-LN보다 더 빨리 수렴하므로 하이퍼파라미터 민감도와 학습 시간이 감소한다.
워밍업 제거는 더 빠른 수렴과 더 적은 하이퍼파라미터 조정으로 상당한 속도 향상을 이끌지만 성능은 경쟁력 있게 유지된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.