QUICK REVIEW

[논문 리뷰] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Jesse Dodge, Gabriel Ilharco|arXiv (Cornell University)|2020. 02. 14.

Topic Modeling인용 수 263

한 줄 요약

논문은 가중치 초기화 및 데이터 순서에 대한 난수 시드가 GLUE 태스크에서 BERT 미세조정의 상당한 변동성을 유발함을 보여주고, 다수의 시도와 조기 중지로 개선을 입증하며; 또한 추가 분석을 위한 광범위한 학습 데이터를 공개한다.

ABSTRACT

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

연구 동기 및 목표

GLUE 태스크에서 가중치 초기화와 데이터 순서의 난수 시드 변동이 미세조정 성능에 어떠한 영향을 미치는지 평가한다.
더 많은 미세조정 시도가 수행될수록 최적 모델의 성능이 어떻게 개선되는지 양적으로 측정한다.
조기 중지를 통해 미세조정 중 낭비되는 계산을 줄일 수 있는 실용적 전략을 식별한다.
일부 시드 구성들이 태스크 전반에 걸쳐 일관되게 강한 초기화나 데이터 순서를 제공하는지 평가한다.

제안 방법

네 가지 GLUE 태스크에서 BERT-large를 미세조정하되 두 개의 난수 시드(최종 분류층의 가중치 초기화 및 학습 데이터 순서)만 다르게 한다.
태스크당 수백 개의 모델을 훈련한다(작은 데이터셋당 625개; SST는 225개) 시드로 인한 분산을 포착하기 위하여.
각 실행에 대한 검증 성능을 측정하고 시도 횟수의 함수로서 기대되는 최고 성능을 계산한다.
WI(가중치 초기화) 및 DO(데이터 순서) 시드의 효과를 개별적으로 및 함께 분석하여 분산 소스를 분리한다.
고정된 계산 예산 하에서 덜 유망한 시도를 조기에 중지하는 알고리즘을 제안하고 평가한다.

실험 결과

연구 질문

RQ1WI와 DO 시드가 GLUE 태스크 전반에서 미세조정 성능 분산에 얼마나 기여하는가?
RQ2최고 성능 모델은 더 많은 미세조정 시도로 눈에 띄게 개선되며, 수렴 속도는 어느 정도인가?
RQ3조기 중지가 기대 성능을 보존하거나 개선하면서 계산량을 줄일 수 있는가?
RQ4일부 시드 구성은 태스크 전반에서 일관되게 강건한가?

주요 결과

다양한 무작위 시드로 여러 차례의 미세조정을 실행하면 네 가지 GLUE 태스크에서 단일 시도 결과보다 상당한 성능 향상을 얻는다.
가중치 초기화와 학습 데이터 순서는 분산에 비슷한 기여를 하며, 일부 시드가 태스크 전반에서 일관되게 더 낫다.
일부 시드 구성은 여러 데이터셋에 걸쳐 강건하여 이전 가능한 좋은 초기화를 시사한다.
간단한 조기 중지 전략은 고정 예산 하에서 가장 유망하지 않은 시도를 조기에 중지함으로써 기대 성능을 향상시킬 수 있다.
저자는 훈련 다이나믹스 분석을 위한 2,100건의 미세조정 에피소드 데이터를 공개한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.