[论文解读] Deep learning generalizes because the parameter-function map is biased towards simple functions
本文认为深度神经网络的参数-函数映射对简单函数呈指数级偏向,这一偏向作为内在正则化并促成良好泛化;它利用算法信息理论和高斯过程PAC-Bayes界来将这一偏向与泛化性能联系起来。
Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly over-parametrized regime where classical learning theory would instead predict that they would severely overfit. While many proposals for some kind of implicit regularization have been made to rationalise this success, there is no consensus for the fundamental reason why DNNs do not strongly overfit. In this paper, we provide a new explanation. By applying a very general probability-complexity bound recently derived from algorithmic information theory (AIT), we argue that the parameter-function map of many DNNs should be exponentially biased towards simple functions. We then provide clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks applied to CIFAR10 and MNIST. As the target functions in many real problems are expected to be highly structured, this intrinsic simplicity bias helps explain why deep networks generalize well on real world problems. This picture also facilitates a novel PAC-Bayes approach where the prior is taken over the DNN input-output function space, rather than the more conventional prior over parameter space. If we assume that the training algorithm samples parameters close to uniformly within the zero-error region then the PAC-Bayes theorem can be used to guarantee good expected generalization for target functions producing high-likelihood training sets. By exploiting recently discovered connections between DNNs and Gaussian processes to estimate the marginal likelihood, we produce relatively tight generalization PAC-Bayes error bounds which correlate well with the true error on realistic datasets such as MNIST and CIFAR10 and for architectures including convolutional and fully connected networks.
研究动机与目标
- 提出使用基于算法信息理论的界来论证DNNs的参数-函数映射偏向简单函数。
- 在MNIST、CIFAR-10和布尔任务上,展示小型DNN以及更大架构(CNNs、FCNs)的经验性简单性偏向。
- 引入一个PAC-Bayes框架,其对输入-输出函数的先验通过高斯过程估计,以对泛化进行界定。
- 证明GP对边际似然的估计在近似NN行为方面有效,并在多种架构和数据集上给出有用的泛化界。
提出的方法
- 为神经模型定义参数-函数映射 M: Θ -> F,并分析其简单性偏向。
- 应用来自算法信息理论的概率-复杂度界,将函数概率与描述复杂度 K(f) 联系起来。
- 通过对参数进行采样并统计函数频率,经验性地估计离散布尔函数DNN的 P(f)。
- 使用高斯过程(GP)近似来估计函数先验 P(f),并计算训练数据的边际似然 P(U)。
- 使用基于GP的先验应用PAC-Bayes界,以获得在各数据集上与实际泛化误差一致的期望泛化界。
- 将基于GP的边际似然与经验NN概率进行比较,以验证GP近似。
实验结果
研究问题
- RQ1DNNs 的参数-函数映射是否对简单函数存在强烈偏向?
- RQ2是否可以用算法信息理论和使用函数空间先验(通过高斯过程)的PAC-Bayes界来解释超参数化网络中观察到的泛化?
- RQ3是否在随机参数采样下,经验性度量(如Lempel-Ziv复杂度)与函数概率相关?
- RQ4基于GP的先验是否足以复制NN的边际似然,从而在真实数据集上产生有意义的泛化界?
- RQ5在零误差区域内,类似SGD的训练是否近似均匀采样,支持该PAC-Bayes框架?
主要发现
- DNNs的参数-函数映射对低复杂度(简单)函数呈指数偏向,导致P(f)分布高度偏斜。
- 在布尔函数DNN及更大架构(CNNs和FCNs)上的经验研究显示,高概率函数具有低Lempel-Ziv复杂度和低类似Kolmogorov的复杂性度量。
- 高斯过程近似能准确再现有限宽度网络的NN边际似然,从而在实际中估计用于PAC-Bayes界的P(U)。
- 使用GP近似先验的PAC-Bayes界在MNIST、fashion-MNIST、CIFAR-10以及CNN和FC架构上跟踪真实泛化误差。
- 类似SGD的训练和基于GP的贝叶斯采样产生相似的函数分布,支持优化偏向于简单、高概率函数的解释。
- 提出的函数空间PAC-Bayes界给出相对紧凑的泛化界,与跨数据集观察到的泛化趋势一致。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。