QUICK REVIEW

[論文レビュー] An Empirical Investigation of the Role of Pre-training in Lifelong Learning

Sanket Vaibhav Mehta, Darshan Patil|arXiv (Cornell University)|Dec 16, 2021

Domain Adaptation and Few-Shot Learning被引用数 42

ひとこと要約

本論文は、汎用の事前学習済み初期化が逐次タスク学習における壊滅的忘却を暗黙的に低減することを示し、損失ランドスケープの平坦性を通じてこの現象の原因を分析し、忘却をさらに緩和するためのシャープネス意識最適化法を提案する。

ABSTRACT

The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks.

研究の動機と目的

孤立した訓練に代わるエネルギー効率の高い生涯学習を促進し、壊滅的忘却に対処する。
多様なタスク多様性を持つNLPとCVのベンチマーク全体で、事前学習が忘却に与える影響を体系的に評価する。
事前学習が忘却を緩和する理由を理解するために、損失ランドスケープを分析する。
明示的に忘却を減らすために、平坦な損失ボウルを狙う最適化目的を提案し、検証する。

提案手法

CVとNLPの標準的なタスク増分生涯学習ベンチマークで、事前学習済みモデルとランダム初期化モデルを比較する。
DistilBERTとResNet-18アーキテクチャを、事前学習済みとランダム初期化の両方で使用する。
逐次ファインチューニング後のミニマの構造を評価するために、損失ランドスケープとシャープネスを分析する。
シャープネス指標を計算し、逐次タスクのミニマの線形補間を行ってボウル幅を評価する。
Sharpness-Aware Minimization (SAM) を適用して、現在のタスク損失とボウルのシャープネスを同時に最適化し、ベースライン（FT、EWC、ER）と比較する。
ResNet-18-PTを事前学習する際、重複するImageNetクラスを削除して事前学習の重複を制御する。

実験結果

リサーチクエスチョン

RQ1多様なタスクとドメインにまたがる生涯学習で、事前学習は忘却を暗黙のうちに緩和するのか。
RQ2事前学習済みモデルは、同質なタスク系列と多様なタスク系列で同様に忘れるのか。
RQ3異なる事前学習初期化（モデルサイズ、コーパスの多様性）は忘却にどのように影響するのか。
RQ4平坦なミニマのための明示的な最適化は、事前学習効果を超えて忘却をさらに低減できるのか。

主な発見

事前学習済みの初期化は、複数のベースラインとベンチマークにおいて、ランダム初期化より著しく忘却が少なくなる。
事前学習の忘却優位性はNLPとCVで持続するが、多様なタスク系列は依然として課題をもたらす。
事前学習コーパスのモデル容量と多様性（例: RoBERTa-base、より大きなモデル）は、忘却をより効果的に軽減する。
事前学習済みウェイトは連続的なファインチューニングをより広い（平坦な）ミニマに配置する傾向があり、損失ランドスケープ分析とシャープネス指標で裏付けられる。
SAMを用いて平坦なボウルを明示的に最適化することで忘却性能が向上し、複数の設定で最先端継続学習法のいくつかを上回ることがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。