QUICK REVIEW

[论文解读] On Lazy Training in Differentiable Programming

Lénaïc Chizat, Edouard Oyallon|arXiv (Cornell University)|Dec 19, 2018

Stochastic Gradient Optimization Techniques被引用 287

一句话总结

论文认为惰性训练（模型在行为上接近其线性化）并非仅由过参数化引起，而是由缩放选择所导致的出现时机；分析何时会发生，并在实践中表明惰性训练相较于非惰性范式可能会降低 CNN 的性能。

ABSTRACT

In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.

研究动机与目标

Motivate and define the lazy training phenomenon in differentiable programming.
Develop a general criterion for when lazy training occurs via scaling and initialization.
Analyze gradient flow dynamics under scaled models and compare to their linearizations.
Provide theoretical bounds and convergence results in lazy regimes across over- and under-parameterized settings.
Evaluate the practical implications of lazy training through synthetic and CNN experiments.

提出的方法

Introduce a scaling factor alpha and study the objective F_alpha(w) = (1/alpha^2) R(alpha h(w)).
Define the linearized model bar{h}(w) around initialization and compare F_alpha with its linearization bar{F}_alpha.
Derive a general lazy training criterion kappa_h(w0) = ||h(w0)-y*|| * ||D^2 h(w0)|| / ||Dh(w0)||^2 and relate to lazy dynamics.
Prove finite-horizon lazy training bounds showing w_alpha(t) approaches w0 and remains close to the linearized path as alpha grows (Theorem 2.2).
Provide a square-loss quantitative bound (Theorem 2.3) and analyze over-parameterized and under-parameterized regimes (Theorems 2.4 and 2.5).
Extend analysis to homogeneous models and two-layer networks, linking to random feature and mean-field limits.

实验结果

研究问题

RQ1在缩放后的模型上进行的基于梯度的优化在何种条件下表现得像在初始化附近对线性化模型的训练？
RQ2初始化、缩放和网络结构如何影响惰性训练的出现？
RQ3在过参数化和欠参数化的情形下，惰性训练的收敛性质与泛化含义是什么？
RQ4实际神经网络（例如 CNN）在惰性范式下是否表现劣于非惰性范式，且是否存在性能下降？
RQ5能否在时间上给出惰性训练动力学与线性化动力学之间距离的界限？

主要发现

惰性训练不仅在过参数化网络中出现，在本质上任何在初始化时输出接近零的参数模型中都可能由于隐式缩放而出现。
对于较大的 alpha，F_alpha 的训练动力学接近线性化模型 bar{F}_alpha，使学习在本质上呈线性化。
在平方损失的情形下，在温和的光滑性假设下，非线性输出与线性化输出之间的距离随 alpha 增大而缩小的显式界限成立。
在对 Jacobian Dh(w0) 和损失的适当条件下，过参数化的惰性训练收敛到全局极小值（定理 2.4）。
在大 alpha 时，欠参数化的惰性训练收敛到局部极小值，表明在有限维设定中可能偏离全局最优解。
数值实验表明，在惰性范式下训练的 CNN 可能表现不如非惰性训练，且可能具有病态条件，这动摇了惰性训练解释神经网络成功的观点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。