QUICK REVIEW

[论文解读] Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Greg Yang|arXiv (Cornell University)|Feb 13, 2019

Gaussian Processes and Bayesian Inference参考文献 78被引用 187

一句话总结

本文提出一个统一的张量程序框架，用以推导宽度神经网络的标度极限，建立高斯过程行为、梯度独立性条件以及在没有批归一化的情况下对标准结构的 Neural Tangent Kernel 收敛。

ABSTRACT

Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline \emph{tensor program} that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows (1) the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; (2) conditions under which the \emph{gradient independence assumption} -- that weights in backpropagation can be assumed to be independent from weights in the forward pass -- leads to correct computation of gradient dynamics, and corrections when it does not; (3) the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

研究动机与目标

定义一个统一的张量程序框架，在权重共享下表达大多数神经网络计算。
描述这些程序在宽度趋于无穷大且采用 Glorot 风格初始化时的标度极限。
推导广义架构（循环神经网络 RNN、卷积神经网络 CNN、残差网络 ResNet、注意力等）的高斯过程行为。
分析在何种情况下梯度独立性假设能得到正确的梯度动力学，并在不成立时给出修正。
证明在初始化时对不含 BN 的架构，Neural Tangent Kernel 的收敛。

提出的方法

引入具有 G、A、H 变量的张量程序来编码神经计算。
定义公用维度类（CDCs）以及权重和输入的采样方案。
证明在宽极限下，G 变量收敛为具有可计算均值和协方差的高斯分布（定理 4.3、5.1、6.3）。
在广义非线性下推导标准架构的 DNN-GP 对应关系（推论 2.1）。
推导（非正式）梯度独立性有效性（推论 2.3），并在必要处给出修正。
对于有限输入集合，在没有批归一化的情况下，建立 Neural Tangent Kernel 收敛 Kθ → K∞（推论 2.4）。

实验结果

研究问题

RQ1在哪些条件下，具有权重共享的宽度神经网络在常见架构下收敛为高斯过程？
RQ2梯度独立性假设在反向传播中何时有效，如果失效，如何计算正确的梯度动力学？
RQ3在没有批归一化的标准架构下，初始化时 Neural Tangent Kernel 的行为如何，何时收敛到极限核 K∞？
RQ4该框架能否作为特例复现经典随机矩阵结果（例如半圆法则、Marchenko-Pastur 分布）？
RQ5权重共享（转置）在各种架构（RNN、CNN、残差、注意力）的标度极限中的作用是什么？

主要发现

DNN-GP 对应关系推广到标准架构和非线性，宽度增大时产生高斯过程极限（推论 2.1）。
梯度独立性假设在某些条件下对多项式有界非线性给出正确的反向传播动力学，在不成立处给出明确修正（推论 2.3）。
在没有批归一化的标准架构中，对有限输入集合，初始化时 NTK 收敛到极限 K∞（推论 2.4）。
张量程序框架可以重新推导经典随机矩阵结果，并与相关算法中的状态演化类分析相关联（如 AMP）。
该工作提供了一种分析信号传播和梯度动力学的通用方法，使得能够设计初始化方案，避免梯度爆炸/消失。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。