QUICK REVIEW

[논문 리뷰] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Jesse Dodge, Gabriel Ilharco|arXiv (Cornell University)|2020. 02. 15.

Topic Modeling참고 문헌 29인용 수 216

한 줄 요약

이 논문은 weight initialization과 데이터 순서 시드에 의해 BERT 미세조정에서 상당한 분산을 보이고, 여러 차례 실험에서 이득을 보이며, 초기 중단 접근법을 도입하고 GLUE 태스크에 대해 2,100개의 미세조정 실행을 공개한다.

ABSTRACT

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

연구 동기 및 목표

사전 학습된 언어 모델의 미세조정 성능에 무작위 시드가 어떤 영향을 미치는지 이해한다.
성능 분산에 대한 가중치 초기화와 데이터 순서의 기여도를 계량한다.
다중 미세조정 실험이 단일 실험에 비해 실질적인 이득을 제공하는지 평가한다.
계산 비용을 줄이면서 성능을 유지하기 위한 초기 중단 전략을 제안한다.
분석을 용이하게 하기 위해 미세조정 데이터(훈련 데이터) 개방 공개를 통해_training dynamics를 공개한다.

제안 방법

최신 레이어 가중치 초기화(WI)와 데이터 순서(DO)를 제어하는 무작위 시드만 바꿔가며 BERT-large를 네 가지 GLUE 태스크에 미세조정한다.
표준 하이퍼파라미터를 사용해 각 태스크를 3에폭으로 학습하고, 모든 시드 조합에 대한 검증 성능을 보고한다.
WI와 DO를 시드의 그리드로 분리해 분산을 분석하고, 시도 수의 함수로서 기대 최적 성능을 계산한다.
최고 WI/DO 시드와 최저 WI/DO 시드 간의 평균 성능 차이가 있는지 ANOVA로 검정한다.
훈련 도중 덜 유망한 실험을 중단해 계산 비용을 절감하는 간단한 초기 중단 알고리즘을 제안하고 평가한다.
훈련 손실과 검증 성능이 포함된 2,100개의 미세조정 실행 전체 데이터셋을 공개적으로 제공한다.

실험 결과

연구 질문

RQ1무작위 시드가 WI 및 DO를 제어하는 미세조정 성능에 어느 정도의 분산을 야기하는가?
RQ2일부 WI 및 DO 시드가 작업 전체에서 일관되게 타 시드보다 우수한가, 그리고 어떤 시드가 데이터세트 간 일반화가 가능한가?
RQ3GLUE 태스크에서 다중 미세조정 시도가 최적의 검증 성능 측면에서 어떤 이점을 제공하는가?
RQ4조기 중단 전략이 최종 성능 손실을 최소화하면서 계산량을 줄일 수 있는가?

주요 결과

다양한 시드를 사용한 다중 미세조정 시도가 네 가지 GLUE 태스크에서 단일 실험에 비해 상당한 이득을 준다.
가중치 초기화와 데이터 순서가 성능 분산에 비례하게 기여하며, 일부 시드가 여러 태스크에서 일관되게 더 우수한 성능을 낸다.
일부 시드 초기화는 여러 태스크에서 좋은 성능을 보이며, 글로벌하게 유리한 WI 시드가 존재함을 시사한다.
초기 중단은 예산 전반에서 동일하거나 개선된 기대 성능을 달성하면서 계산 비용을 절감할 수 있다.
다수의 시도로 발견된 최적 성능은 동일한 모델과 설정을 사용한 이전에 발표된 결과를 여러 태스크에서 상당히 능가한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.