QUICK REVIEW

[论文解读] Predicting the Generalization Gap in Deep Networks with Margin Distributions

Yiding Jiang, Dilip Krishnan|arXiv (Cornell University)|Sep 28, 2018

Adversarial Robustness in Machine Learning参考文献 35被引用 86

一句话总结

本论文提出一种在多层上基于边距分布的度量来预测深度网络的泛化差距，与 CIFAR-10/100 的保持集准确率呈现出强相关性，并且优于若干界限。该方法使用在各层拼接的归一化边距统计量，以及一个简单的线性预测器。

ABSTRACT

As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap. Our measure is based on the concept of margin distribution, which are the distances of training points to the decision boundary. We find that it is necessary to use margin distributions at multiple layers of a deep network. On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates very strongly with the generalization gap. In addition, we find the following other factors to be of importance: normalizing margin values for scale independence, using characterizations of margin distribution rather than just the margin (closest distance to decision boundary), and working in log space instead of linear space (effectively using a product of margins rather than a sum). Our measure can be easily applied to feedforward deep networks with any architecture and may point towards new training loss functions that could enable better generalization.

研究动机与目标

Motivate and quantify the generalization gap in deep networks where training loss is small and traditional losses/bounds fail to predict generalization.
Develop a margin-distribution based measure that captures information across multiple layers to predict generalization gap.
Normalize and summarize layer-wise margin distributions to form a compact feature set for regression.
Demonstrate the predictive power of the proposed measure across architectures (CNNs and ResNets) and datasets (CIFAR-10/100).
Propose that margin-based measures could inspire new loss functions or training techniques for better generalization.

提出的方法

Define layer-wise margin distances using a first-order Taylor approximation to distance to the decision boundary (Eq. 3).
Normalize margins by the square root of the total variation (empirical covariance trace) of layer activations (Eq. 5).
Construct margin distributions at each layer and summarize them with quartiles and fences (5 statistics per layer).
Concatenate layer-wise signatures into a total signature vector theta (typically using four layers: input and three hidden).
Predict the generalization gap with a linear model hat{g} = a^T phi(theta) + b, comparing phi(x)=x and phi(x)=log(x).
Evaluate predictive power using R^2 on held-out model pools via k-fold (k=10) and report adjusted R^2 as model fit measure.

实验结果

研究问题

RQ1Can margin distributions at hidden layers predict the generalization gap better than output-layer margins or norm-based bounds?
RQ2Does normalizing margins and aggregating layer-wise margin information improve generalization gap prediction?
RQ3How many and which layers are needed to achieve accurate predictions across architectures?
RQ4Can a simple linear model on transformed margin signatures robustly predict generalization gap across datasets and architectures?

主要发现

Normalized, multi-layer margin distributions correlate strongly with generalization gap, improving prediction over output-margin baselines.
Using quartile-based signatures with log transformation yields high predictive power (adjusted R^2 values shown in experiments).
Margin information from hidden layers is crucial for predictive accuracy, not just margins at the input or output layers.
The proposed margin-based predictor outperforms Bartlett et al. (2017) and other baselines in predicting generalization gap on CIFAR-10/100 with CNNs and ResNets.
The approach applies to feedforward networks including ResNets and suggests potential for new loss functions to improve generalization.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。