QUICK REVIEW

[논문 리뷰] A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu, Yuanzhi Li|arXiv (Cornell University)|2018. 11. 09.

Reinforcement Learning in Robotics참고 문헌 55인용 수 627

한 줄 요약

이 논문은 과다 매개화된 심층 신경망이 무작위 초기화에서 SGD/경사 하강법으로 학습 에러를 0으로 만드는 데 다항 시간으로 학습될 수 있음을, 초기화의 큰 이웃에서 근사 볼록성 및 NTK 동등성을 보여 주어 입증한다.

ABSTRACT

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $ extit{global minima}$ on the training objective of DNNs in $ extit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $ extit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

연구 동기 및 목표

1차 방법으로 학습된 심층 네트워크가 비볼록하고 비평활한 objetivo에도 불구하고 실무에서 성공하는 이유에 대한 이론적 이해를 고무합니다.
과다 매개화된 심층 네트워크가 무작위 초기화에서 학습 에러를 0으로 만드는 데 다항 시간으로 학습될 수 있음을 보인다.
2층에서 다층 네트워크로의 과다 매개화 이론 확장, ReLU 활성화 및 다양한 아키텍처 포함.
유한 다항 폭에서 과다 매개화 네트워크와 NTK 간의 연결을 확립한다.
완전연결, CNN, 잔차 네트워크 아키텍처에 대해 가벼운 데이터 가정하에 적용 가능한 프레임워크를 제공한다.

제안 방법

L-층 완전 연결 네트워크의 ReLU 활성화에서 L2 회귀하에서의 학습 역학을 분석하고(다른 손실에도 확장 가능).
무작위 초기화에 가까운 영역에서 목적함수가 거의 볼록하고 준매끄다는 것을 증명하여 SGD/GD가 다항 시간 내 수렴하도록 한다.
유한 폭(m = poly(L))에서 과다 매개화 네트워크와 NTK의 등가성을 보인다.
ReLU의 비매끄러움을 다루기 위해 부호 행렬 D_i,ℓ를 이용한 기울기 공식과 역전파 구조를 도출한다.
L 계층에 걸쳐 순방향/역전파가 제어된 상태를 유지함을 보이고(지수적 기울기 폭주나 소멸 없음).
작은 섭동에 대한 안정성 분석을 제공하고 NTK 동작을 통한 일반화에 대한 함의를 논의한다.

실험 결과

연구 질문

RQ1Can deep neural networks trained by SGD from random initialization achieve zero training error under mild over-parameterization and non-degenerate data?
RQ2How large must the hidden width be (as a polynomial in n, L, and data separation δ) to guarantee polynomial-time convergence?
RQ3Does the training landscape exhibit near-convexity and semi-smoothness in a neighborhood of random initialization for multi-layer networks?
RQ4Is there a finite-width equivalence between over-parameterized networks and the neural tangent kernel (NTK) similar to infinite-width results?
RQ5Do these results extend to CNNs and ResNets with ReLU activations and to various loss functions beyond squared loss?

주요 결과

Gradient descent finds an ε-error global minimum in poly(n,L,δ^{-1}) iterations for regression tasks, given width m ≥ poly(n,L,δ^{-1})·d.
SGD achieves the same training-error objective in poly(n,L,δ^{-1})·log^2 m iterations with appropriate learning rate and mini-batch size.
The objective near random initialization is almost convex and semi-smooth, precluding problematic saddles and enabling guaranteed descent.
There is a polynomial-width equivalence between over-parameterized networks and the NTK in the finite-width setting (not only at infinite width).
The analysis handles non-smooth ReLU activations and extends to CNNs and ResNets with broad applicability of the results.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.