QUICK REVIEW

[论文解读] Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Lénaïc Chizat, Francis Bach|arXiv (Cornell University)|Feb 11, 2020

Stochastic Gradient Optimization Techniques参考文献 49被引用 90

一句话总结

这篇论文刻画了具有指数尾损失的无限宽度两层网络的梯度流的隐式偏差，显示其收敛到非希尔伯特空间中的最大间隔分类器，并将同时训练两层与仅训练输出层的结果进行了对比并给出经验验证。

ABSTRACT

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.

研究动机与目标

激发对为什么过参数化神经网络用梯度方法泛化良好进行理解。
表征无限宽两层网络在 2-homogeneous 激活下的梯度流极限行为。
证明学习到的分类器是在 variation-norm 空间中的最大边距解。
对比同时训练两层与仅训练输出层并分析对泛化的影响。
提供数值证据支持对使用 ReLU 的两层网络的理论发现。

提出的方法

将预测器建模为有限宽度的两层网络，具有 2-同质激活和对称/平衡结构。
使用基于测度的凸重构通过 variation norm 1 及其 max-margin 1()1 目标来描述预测器。
将无限宽度极限表述为关于参数的概率测度的 Wasserstein 梯度流。
在适当假设下，证明训练动力学的极限产生 1-max-margin 问题的最大化解。
与 RKHS 框架 2 进行对比，并讨论仅训练输出层的计算方面。
在简化动力学中讨论收敛速率和在线镜像上升与简化动力学的联系。

实验结果

研究问题

RQ1梯度流在指数尾损失下是否收敛到 variation-norm 函数空间 1 的全局最大边距解？
RQ2在隐式偏置方面，联合训练两层与仅训练输出层的训练动力学有何不同？
RQ3是否可以为具有隐藏低维结构的网络建立维度无关的泛化？
RQ4对宽度两层带 ReLU 的网络的数值实验是否与理论的最大边距表征一致？

主要发现

无穷宽两层网络在指数尾损失下的梯度流收敛到 1 variation-norm 空间中的最大边距分类器。
具有隐藏低维结构时，得到的边距与环境维度无关，从而实现强泛化保证。
仅训练输出层隐式地在 2 RKHS 中求解核SVM，这可能不具备与 1 边距相同的自适应性。
在简化设定下，训练动力学等价于在线镜像上升，收敛速率为 O(log t / sqrt t)。
数值实验表明理论描述了两层 ReLU 网络的实际行为，并支持隐式偏置的统计益处。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。