QUICK REVIEW

[论文解读] An Empirical Investigation of the Role of Pre-training in Lifelong Learning

Sanket Vaibhav Mehta, Darshan Patil|arXiv (Cornell University)|Dec 16, 2021

Domain Adaptation and Few-Shot Learning被引用 42

一句话总结

这篇论文表明通用的预训练初始化在序列任务学习中隐式减轻灾难性遗忘，分析其原因通过损失景观的平坦性，并提出一种基于锐度的优化方法以进一步缓解遗忘。

ABSTRACT

The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks.

研究动机与目标

将终身学习作为相对于孤立训练的更节能的替代方案来进行动机说明，并解决灾难性遗忘。
在 NLP 和 CV 基准上系统性评估预训练对遗忘的影响，覆盖不同任务多样性。
分析损失景观，理解为什么预训练能减轻遗忘。
提出并测试一个针对平坦损失盆地的优化目标，以显式减少遗忘。

提出的方法

在 CV 和 NLP 的标准任务增量终身学习基准上比较预训练与随机初始化的模型。
使用 DistilBERT 和 ResNet-18 架构，分别采用预训练和随机初始化。
分析损失景观和锐度，以评估序列微调后的极小值结构。
计算锐度度量并对序列任务极小值进行线性插值，以评估盆地宽度。
应用 Sharpness-Aware Minimization (SAM) 以联合优化当前任务损失和盆地锐度，并与基线（FT、EWC、ER）进行比较。
通过在预训练 ResNet-18-PT 时移除重叠的 ImageNet 类来控制预训练重叠。

实验结果

研究问题

RQ1预训练是否在多样化任务和领域中隐式缓解终身学习中的遗忘？
RQ2预训练模型在同质与多样的任务序列上是否遗忘类似？
RQ3不同的预训练初始（模型大小、语料多样性）如何影响遗忘？
RQ4是否显式优化平坦极小值可以在比预训练效果更进一步地减少遗忘？

主要发现

预训练初始化在多项基线和基准测试中显著比随机初始化导致更少的遗忘。
在 NLP 和 CV 中，预训练的遗忘优势仍然存在，尽管多样的任务序列仍然带来挑战。
模型容量和预训练语料库的多样性（例如 RoBERTa-base、较大模型）更有效地降低遗忘。
预训练权重往往将序列微调置于更宽广（更平坦的）极小值之中，这一点通过损失景观分析和锐度度量得到证实。
通过对平坦盆地进行显式优化（使用 SAM）来改善遗忘性能，并且在多种设置中可以超越若干最先进的持续学习方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。