QUICK REVIEW

[论文解读] Scalable and Order-robust Continual Learning with Additive Parameter Decomposition

Jaehong Yoon, Saehoon Kim|arXiv (Cornell University)|Feb 25, 2019

Domain Adaptation and Few-Shot Learning被引用 33

一句话总结

引入加性参数分解（APD）用于持续学习，将参数分为共享部分和稀疏的任务自适应部分，并采用追溯性更新与分层巩固以实现可扩展性和对顺序的鲁棒性。

ABSTRACT

While recent continual learning methods largely alleviate the catastrophic problem on toy-sized datasets, some issues remain to be tackled to apply them to real-world problem domains. First, a continual learning model should effectively handle catastrophic forgetting and be efficient to train even with a large number of tasks. Secondly, it needs to tackle the problem of order-sensitivity, where the performance of the tasks largely varies based on the order of the task arrival sequence, as it may cause serious problems where fairness plays a critical role (e.g. medical diagnosis). To tackle these practical challenges, we propose a novel continual learning method that is scalable as well as order-robust, which instead of learning a completely shared set of weights, represents the parameters for each task as a sum of task-shared and sparse task-adaptive parameters. With our Additive Parameter Decomposition (APD), the task-adaptive parameters for earlier tasks remain mostly unaffected, where we update them only to reflect the changes made to the task-shared parameters. This decomposition of parameters effectively prevents catastrophic forgetting and order-sensitivity, while being computation- and memory-efficient. Further, we can achieve even better scalability with APD using hierarchical knowledge consolidation, which clusters the task-adaptive parameters to obtain hierarchically shared parameters. We validate our network with APD, APD-Net, on multiple benchmark datasets against state-of-the-art continual learning methods, which it largely outperforms in accuracy, scalability, and order-robustness.

研究动机与目标

在扩展到大量任务的同时解决灾难性忘记。
减轻任务序列的顺序敏感性，确保公平、稳定的性能。
提供高效的内存和计算的持续学习框架。
引入追溯性更新与分层知识巩固以提高鲁棒性和可扩展性。

提出的方法

将网络参数分解为任务共享的 sigma 和稀疏的任务自适应 tau，并使用掩码 M_t 来引导任务特定的使用。
对 sigma、tau_t 和掩码参数进行联合优化并加入正则化：最小化 L(...)+ lambda1||tau_t||_1 + lambda2||sigma - sigma^(t-1)||^2。
应用追溯性更新：在任务 t 时，通过更新后的 sigma 和 M_i 重构先前的 theta_i，然后更新 tau_i 以尽量接近过去的解（式(2)）。
通过对任务自适应参数进行聚类并将共享部分与局部组件分离来实现分层知识巩固，以减少容量增长（式(3)）。
通过对已完成的任务丢弃 tau_t 实现选择性遗忘，而不影响其他任务。

实验结果

研究问题

RQ1如何在不显著增加容量的情况下使持续学习对大量任务具有可扩展性？
RQ2我们是否能够降低顺序敏感性，使任务序列顺序对最终性能的影响最小？
RQ3将参数分解为共享和稀疏的任务自适应组件是否能有效防止灾难性忘记？
RQ4通过在相关任务之间共享知识，分层巩固是否能够进一步提高效率？
RQ5在不损害非目标任务的情况下，是否可实现选择性遗忘？

主要发现

APD-Net 在准确率上超过最先进的基线，同时所用容量远小于扩展式方法。
对过去任务自适应参数的追溯性更新减少语义漂移并改善对顺序的鲁棒性。
分层知识巩固进一步降低容量增长并增强相关任务之间的迁移。
APD-Net 在处理大量任务时表现出强烈的可扩展性（例如 Omniglot-rotation，100 任务），参数增长呈对数级。
选择性遗忘可以移除某个任务的参数，而不降低对剩余任务的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。