[Paper Review] A comprehensive, application-oriented study of catastrophic forgetting in DNNs
This paper presents a large-scale empirical study of catastrophic forgetting (CF) in sequential learning for DNNs, under application-oriented constraints, across many datasets and SLTs, and analyzes the effectiveness of EWC and IMM among others. It concludes that CF persists under realistic conditions and discusses practical workarounds.
We present a large-scale empirical study of catastrophic forgetting (CF) in modern Deep Neural Network (DNN) models that perform sequential (or: incremental) learning. A new experimental protocol is proposed that enforces typical constraints encountered in application scenarios. As the investigation is empirical, we evaluate CF behavior on the hitherto largest number of visual classification datasets, from each of which we construct a representative number of Sequential Learning Tasks (SLTs) in close alignment to previous works on CF. Our results clearly indicate that there is no model that avoids CF for all investigated datasets and SLTs under application conditions. We conclude with a discussion of potential solutions and workarounds to CF, notably for the EWC and IMM models.
Motivation & Objective
- Motivate and formalize an application-oriented evaluation protocol for sequential learning in DNNs that reflects real-world constraints (memory, causality, and update complexity).
- Systematically assess CF across a large, diverse set of visual classification datasets and sequential learning tasks to determine whether any current model avoids CF under those constraints.
- Compare several DNN approaches (Dropout variants, LWTA, EWC, IMM) in terms of their CF behavior and practical feasibility under the proposed protocol.
- Highlight limitations of common CF benchmarks (e.g., permutation-based SLTs) and provide guidance on interpretation and model selection under applied conditions.
Proposed method
- Define Sequential Learning Tasks (SLTs) with two sub-tasks per dataset, constructed from many visual datasets under consistent protocol constraints.
- Implement and evaluate multiple DNN architectures: FC, CONV, D-FC, D-CONV with Dropout, LWTA, EWC, and IMM, using a TensorFlow-based framework.
- Perform a combinatorial hyper-parameter search for each model on the first sub-task D1 to select a candidate, then re-train on D2 with varying learning rates to assess CF.
- Quantify incremental learning quality q as the best/last performance on the union task D1∪D2 under application-oriented criteria, and compare against a baseline trained on D1∪D2 without CF.
- For IMM, employ weight-transfer/merging schemes and analyze sensitivity to the balancing parameter α, including practical feasibility considerations.
Experimental results
Research questions
- RQ1Does catastrophic forgetting persist across a broad set of datasets and SLTs under application-oriented constraints?
- RQ2Which DNN approaches (Dropout variants, EWC, IMM, LWTA) mitigate CF most effectively under realistic resource and causality constraints?
- RQ3How do evaluation criteria that respect application constraints (best vs. last) influence conclusions about CF and model effectiveness?
- RQ4Are permutation-based SLTs reliable benchmarks for CF assessment under applied scenarios?
- RQ5What practical workarounds could make methods like EWC or IMM viable in real-world sequential learning settings?
Key findings
- All tested models exhibit catastrophic forgetting under the proposed application-oriented protocol across the studied datasets and SLTs.
- EWC provides mild protection against CF for simple SLTs but is ineffective for more complex SLTs with comparable sub-task sizes (e.g., D5-5 type tasks).
- IMM generally achieves the best incremental learning performance among models but is hindered by high computational cost and parameter tuning that conflicts with application constraints; in practice it is often infeasible.
- Permutation-based SLTs (DP10-10) do not exhibit CF for any model, suggesting such SLTs should be treated with caution as benchmarks for CF.
- Model selection (hyper-parameter tuning) must be integral to training on SLTs under application scenarios, as poor choices on D1 can severely degrade performance on subsequent tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.