QUICK REVIEW

[논문 리뷰] R-Drop: Regularized Dropout for Neural Networks

Xiaobo Liang, Lijun Wu|arXiv (Cornell University)|2021. 06. 28.

Advanced Neural Network Applications참고 문헌 74인용 수 306

한 줄 요약

R-Drop 정규화는 두 dropout 생성 서브모델 출력 간의 일관성을 양방향 KL발산을 통해 강제함으로써 일반화를 향상시키고, NLP 및 CV 작업에서 성과를 개선하며 일부 번역 벤치마크에서 SOTA를 달성한다.

ABSTRACT

Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$ o$German translation ($\bf{30.91}$ BLEU) and WMT14 English$ o$French translation ($\bf{43.95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub{\url{https://github.com/dropreg/R-Drop}}.

연구 동기 및 목표

드롭아웃으로 인한 딥 네트워크의 학습-추론 불일치를 동기로 삼고 해결한다.
두 dropout 유도 서브모델 출력 간의 일관성을 강제하는 간단한 정규화 방법을 제안한다.
R-Drop이 학습-추론 불일치를 어떻게 감소시키는지 이론적으로 분석한다.
대형 사전학습 모델을 포함한 NLP 및 CV 작업 전반에 걸친 보편적 효과를 실증적으로 보여준다.

제안 방법

각 학습 샘플에 대해 서로 다른 dropout 인스턴스화를 사용해 두 번의 순전파를 실행하여 P1(y|x)와 P2(y|x)를 얻는다.
표준 음의 로그가능성 손실 외에 P1과 P2 사이의 양방향 KL발산을 최소화한다.
최종 목적함수는 두 번의 NLL 손실과 alpha 가중 KL 항의 합으로 구성된다: L = L_NLL1 + L_NLL2 + (alpha/2)[KL(P1||P2) + KL(P2||P1)].
동일 미니배치 내에서 두 번의 순회를 계산하기 위해 배치의 각 입력을 복제하여 하나의 학습 단계로 제공한다.
알고리즘적으로 샘플당 두 개의 dropout 서브모델로 학습하고 결합 손실을 최소화하여 파라미터를 업데이트한다.
이론적 분석은 제약이 선형 모델에서 서브모델 손실과 전체 모델 손실 간의 불일치를 한정한다는 것을 보여준다.

실험 결과

연구 질문

RQ1두 dropout 유도 서브모델 간의 출력 일관성 강제가 다양한 작업에서 일반화를 향상시키는가?
RQ2R-Drop이 학습-추론 불일성과 정규화 강도에 미치는 영향은 무엇인가?
RQ3추가 데이터나 아키텍처 변경 없이 vanilla Transformer와 대형 사전학습 모델에서 SOTA를 달성할 수 있는가?
RQ4다양한 도메인에서 학습 중 R-Drop 적용의 안정성 및 비용 영향은 어떤가?

주요 결과

R-Drop은 NLP, 언어 모델링, 이미지 분류를 포함한 5개 작업과 18개 데이터셋 전반에서 상당한 개선을 가져온다.
vanilla Transformer로 WMT14 En→De 및 En→Fr 번역에서 각각 30.91 BLEU와 43.95 BLEU를 달성하여 이전 SOTA를 상회한다.
GLUE에서 RoBERTa-large의 RD는 평균 89.73으로, XLNet-large, ELECTRA-large 등 강력한 baselines를 능가한다.
CNN/Daily Mail 요약에서 BART+RD는 SOTA ROUGE-L을 달성하고 ROUGE-1/2는 BART 대비 약 0.3포인트 향상되었다.
Wikitext-103 언어 모델링에서 RD는 Transformer 및 Adaptive Input Transformer 베이스라인의 perplexity를 개선한다(예: Transformer: 유효 검증 25.76에서 23.97; 테스트 26.62에서 24.94).
이미지 분류에서 RD에 따른 ViT 모델의 정확도 향상(예: ViT-B/16: CIFAR-100 92.64→93.29; ImageNet 83.97→84.38)

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.