QUICK REVIEW

[논문 리뷰] An Empirical Investigation of the Role of Pre-training in Lifelong Learning

Sanket Vaibhav Mehta, Darshan Patil|arXiv (Cornell University)|2021. 12. 16.

Domain Adaptation and Few-Shot Learning인용 수 42

한 줄 요약

본 논문은 일반적으로 사전 학습된 초기화가 순차적 작업 학습에서 재앙적 망각을 암묵적으로 줄인다는 것을 보여주고, 손실 경관의 평탄화를 통해 왜 그런지 분석하며, 망각을 더 완화하기 위해 sharpness-aware optimization 방법을 제안한다.

ABSTRACT

The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks.

연구 동기 및 목표

lifelong learning을 isolated training에 대한 에너지 효율적인 대안으로 동기를 부여하고 재앙적 망각을 다룬다.
다양한 작업 다양성을 가진 NLP 및 CV 벤치마크에서 사전 학습이 망각에 미치는 영향을 체계적으로 평가한다.
사전 학습이 망각을 완화하는 원인을 이해하기 위해 손실 경관을 분석한다.
평탄한 손실 베이스를 명시적으로 감소시키는 최적화 목적을 제안하고 망각을 실험적으로 감소시키는지 평가한다.

제안 방법

CV 및 NLP의 표준 작업 점진적 lifelong learning 벤치마크에서 사전 학습된 모델과 무작위 초기화 모델을 비교한다.
DistilBERT와 ResNet-18 아키텍처를 사전 학습 및 무작위 초기화와 함께 사용한다.
순차적 미세 조정 후의 미니마 구조를 평가하기 위해 손실 경관과 샤프니스를 분석한다.
샤프니스 지표를 계산하고 순차 작업 최소값의 선형 보간을 수행해 베이스의 너비를 평가한다.
Sharpness-Aware Minimization (SAM)을 적용해 현재 작업 손실과 베이스의 샤프니스를 함께 최적화하고 baselines (FT, EWC, ER)와 비교한다.
pre-training overlap를 제어하기 위해 ResNet-18-PT를 사전 학습할 때 중복되는 ImageNet 클래스를 제거한다.

실험 결과

연구 질문

RQ1사전 학습이 다양한 작업 및 도메인에서 라이프롱 학습의 망각을 암묵적으로 완화하는가?
RQ2사전 학습된 모델은 동종 작업 시퀀스와 다양성 있는 작업 시퀀스에서 망각이 유사하게 발생하는가?
RQ3다른 사전 학습 초기화(모델 크기, 말뭉치 다양성)가 망각에 어떤 영향을 미치는가?
RQ4사전 학습 효과를 넘어 베이스를 명시적으로 평탄하게 만드는 최적화가 망각을 더 줄일 수 있는가?

주요 결과

사전 학습된 초기화가 무작위 초기화에 비해 여러 벤치마크 및 baselines에서 망각을 현저히 덜 유발한다.
자연어 처리(NLP)와 컴퓨터 비전(CV)에서도 망각의 이점이 유지되지만 다양성 높은 작업 시퀀스가 여전히 도전을 제기한다.
모델 용량 및 사전 학습 코퍼스의 다양성(예: RoBERTa-base, 더 큰 모델)이 망각을 더 효과적으로 감소시킨다.
사전 학습 가중치는 순차 미세 조정을 더 넓고(flatter)한 베이스에 배치하는 경향이 있으며, 손실 경관 분석 및 샤프니스 지표로 이를 확인할 수 있다.
샤프니스가 있는 베이스를 Explicit하게 최적화하는 SAM은 망각 성능을 향상시키고 여러 설정에서 여러 최첨단 지속 학습 방법을 능가할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.