QUICK REVIEW

[论文解读] Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting

Hippolyt Ritter, Aleksandar Botev|arXiv (Cornell University)|May 20, 2018

Domain Adaptation and Few-Shot Learning被引用 98

一句话总结

本文提出一种Kronecker因式分解的在线拉普拉斯近似，以缓解神经网络的灾难性遗忘，在线更新高斯后验，使用块对角Kronecker因式分解的Hessian。它在50个置换MNIST任务序列上实现了超过90%的测试准确率，优于若干基线。

ABSTRACT

We introduce the Kronecker factored online Laplace approximation for overcoming catastrophic forgetting in neural networks. The method is grounded in a Bayesian online learning framework, where we recursively approximate the posterior after every task with a Gaussian, leading to a quadratic penalty on changes to the weights. The Laplace approximation requires calculating the Hessian around a mode, which is typically intractable for modern architectures. In order to make our method scalable, we leverage recent block-diagonal Kronecker factored approximations to the curvature. Our algorithm achieves over 90% test accuracy across a sequence of 50 instantiations of the permuted MNIST dataset, substantially outperforming related methods for overcoming catastrophic forgetting.

研究动机与目标

开发一个贝叶斯在线学习框架，以缓解神经网络的灾难性遗忘。
提出并实现一个Kronecker因式分解的拉普拉斯近似，用于在任务之间跟踪后验分布。
利用块对角Hessian结构，使该方法能够扩展到现代架构。
研究基于曲率的超参数正则化，以在记忆和可塑性之间取得平衡。

提出的方法

将贝叶斯在线学习表述为近似高斯后验 q(θ|φt)，其特征为均值 μt 和精度 Λt。
使用两步更新：(i) 通过最大化 log p(Dt+1|θ) + log q(θ|φt) 来更新 μt+1；(ii) 设 Λt+1 = Ht+1(μt+1) + Λt，使用新数据对数似然的Hessian。
使用基于费舍尔信息的半正定矩阵近似海森，确保 Λt 为PSD。
采用块对角Kronecker因式分解的海森，使每一层的曲率为 Hl = Ql ⊗ Hl，便于通过 vec(Wl−Wl*) 高效更新。
将后验表示为跨层的矩阵正态分布，并在层内维护曲率相互作用，而非跨层。
引入对海森的正则化乘数 λ，以控制近似后验的宽度：Λt+1 = λ Ht+1(μt+1) + Λt。

实验结果

研究问题

RQ1具有Kronecker因式分解曲率的在线拉普拉斯近似，能否在长任务序列中有效缓解神经网络的遗忘？
RQ2在层内包含参数间相互作用（Kronecker因式分解）是否优于对角近似用于持续学习？
RQ3正则化超参数λ 对在线持续学习中的记忆与可塑性的影响？
RQ4与EWC和SI相比，所提方法在视觉及基于MNIST的持续学习基准上的扩展性如何？
RQ5每个任务重新计算曲率有必要吗，还是保留曲率就足以在不损害性能的情况下？

主要发现

Kronecker因式分解的在线拉普拉斯在50个置换MNIST任务中的平均测试准确率超过90%，接近联合训练性能。
Kronecker因式分解曲率在记住旧任务方面始终优于对角曲率，同时对新任务仍具备能力。
introducing λ helps regulate the posterior width; for permuted MNIST, λ≈3 yielded optimal balance between memory and learning new tasks.
Diagonal (EWC-like) approximations underperform compared to Kronecker-factored approaches, highlighting the importance of weight interactions within layers.
Regularization remains beneficial even with Kronecker factorization, suggesting potential gains from better curvature approximations.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。