[论文解读] MaxUp: A Simple Way to Improve Generalization of Neural Network Training
MaxUp 在增强数据上最小化最大损失以诱导梯度范数正则化,在最小开销下提升在视觉、语言和认证任务上的泛化能力。
We propose \emph{MaxUp}, an embarrassingly simple, highly effective technique for improving the generalization performance of machine learning models, especially deep neural networks. The idea is to generate a set of augmented data with some random perturbations or transforms and minimize the maximum, or worst case loss over the augmented data. By doing so, we implicitly introduce a smoothness or robustness regularization against the random perturbations, and hence improve the generation performance. For example, in the case of Gaussian perturbation, \emph{MaxUp} is asymptotically equivalent to using the gradient norm of the loss as a penalty to encourage smoothness. We test \emph{MaxUp} on a range of tasks, including image classification, language modeling, and adversarial certification, on which \emph{MaxUp} consistently outperforms the existing best baseline methods, without introducing substantial computational overhead. In particular, we improve ImageNet classification from the state-of-the-art top-1 accuracy $85.5\%$ without extra data to $85.8\%$. Code will be released soon.
研究动机与目标
- Motivate overfitting and generalization gaps in neural network training.
- Propose MaxUp to enforce robustness against random data perturbations.
- Show that MaxUp acts as a gradient-norm regularization under Gaussian perturbations.
- Demonstrate improvements across image classification, language modeling, and adversarial certification.
提出的方法
- 生成每个数据点来自扰动分布 P(·|x) 的 m 个增强副本。
- 最小化这 m 个增强副本中的最坏情况损失:min_theta E_x~D[ max_{i in [m]} L(x_i', theta) ]。
- 仅对每个数据点的最坏增强副本进行反向传播,给出简单的 SGD 更新(梯度等于最坏副本的梯度)。
- 通过 Taylor 展开将 MaxUp 解释为引入梯度范数正则化项 ||∇_x L(x, theta)||_2,系数 c_{m,σ} = Θ(σ sqrt(log m))。
- 表示在各向同性高斯扰动 P(·|x)=N(x, σ^2 I) 下,期望的 MaxUp 风险近似于 L(x, theta) + c_{m,σ}||∇_x L(x, theta)||_2 + O(σ^2)。
- 解释 MaxUp 如何与现有数据增强互补,以及它与轻量级对抗训练和在线困难样本挖掘的关系。
实验结果
研究问题
- RQ1Does maximizing the loss over augmented data improve generalization beyond standard data augmentation?
- RQ2How does MaxUp relate to gradient-norm regularization under perturbations such as Gaussian noise?
- RQ3Can MaxUp improve performance across diverse tasks (vision, language modeling, certified robustness) and architectures without substantial computational overhead?
- RQ4How does the choice of m and the augmentation distribution P(·|x) affect performance across datasets?
- RQ5How does MaxUp interact with existing adversarial training regimes and certification methods?
主要发现
- MaxUp improves generalization across image classification, language modeling, and adversarial certification tasks.
- On ImageNet, MaxUp with CutMix raises top-1 accuracy from 85.5% (state-of-the-art with extra data not used) to 85.8%.
- On CIFAR-10 with Cutout, MaxUp improves accuracy from 95.41% to 95.52% (averaged over runs) for certain architectures.
- On CIFAR-100, MaxUp with Cutout improves accuracy from 75.26% to 82.48% (WideResNet-28-10, table shows 82.48% with m=10).
- In language modeling, MaxUp applied to AWD-LSTM yields lower perplexities on PTB and WT2 than prior state-of-the-art baselines.
- For adversarial certification, MaxUp with Gaussian perturbations (MaxUp+Gauss) outperforms Cohen et al. (2019) and PGD-based training across examined radii, with faster, easier hyperparameter tuning.
- MaxUp provides a lightweight alternative to PGD adversarial training, with minimal overhead and broad compatibility with augmentation schemes.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。