QUICK REVIEW

[논문 리뷰] Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Greg Yang|arXiv (Cornell University)|2019. 02. 13.

Gaussian Processes and Bayesian Inference참고 문헌 78인용 수 187

한 줄 요약

논문은 넓은 신경망의 스케일링 한계를 도출하기 위한 통합 텐서 프로그램 프레임워크를 제시하고, 배치 정규화 없이 표준 아키텍처에서 가우시안 프로세스 행동, 그래디언트 독립성 조건, 그리고 Neural Tangent Kernel의 수렴을 확립한다.

ABSTRACT

Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline \emph{tensor program} that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows (1) the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; (2) conditions under which the \emph{gradient independence assumption} -- that weights in backpropagation can be assumed to be independent from weights in the forward pass -- leads to correct computation of gradient dynamics, and corrections when it does not; (3) the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

연구 동기 및 목표

가중치 공유하에 대부분의 신경망 계산을 표현하는 통합 텐서 프로그램 프레임워크를 정의한다.
글로로츠 스타일 초기화 하에서 폭이 무한대로 갈 때 이 프로그래그램들의 스케일링 한계를 특징지운다.
넓은 아키텍처(RNNs, CNNs, ResNets, 어텐션 등)에 대한 가우시안 프로세스 동작을 도출한다.
그래디언트 독립성 가정이 올바른 그래디언트 동역학을 산출하는 시점을 분석하고 그렇지 않을 때 보정책을 제시한다.
배치 정규화가 없는 아키텍처에서 초기화 시 Neural Tangent Kernel의 수렴(K∞)을 보인다.

제안 방법

G-, A-, H-vars로 신경망 연산을 인코딩하는 텐서 프로그램을 도입한다.
가중치와 입력에 대한 공통 차원 클래스(CDCs)와 샘플링 체계를 정의한다.
넓은 한계 하에서 G-vars가 평균과 공분산이 계산 가능한 가우시안으로 수렴함을 보인다(정리 4.3, 5.1, 6.3).
표준 아키텍처에 대한 DNN-GP 대응을 넓은 비선형성 하에서 도출(코릴러 2.1).
(비형식적) 그래디언트 독립성 타당성 도출(Corollary 2.3) 및 필요 시 보정.
배치 노름 없이 유한 입력 세트에 대해 Neural Tangent Kernel 수렴 Kθ → K∞를 확립(Corollary 2.4).

실험 결과

연구 질문

RQ1가중치 공유를 가진 넓은 신경망이 일반 아키텍처에서 가우시안 프로세스로 수렴하는 조건은 무엇인가?
RQ2역전파 시 그래디언트 독립성 가정이 언제 타당하며, 실패할 경우 올바른 그래디언트 동역학을 어떤 방식으로 계산할 수 있는가?
RQ3배치 정규화 없이 표준 아키텍처에서 초기화 시 Neural Tangent Kernel은 어떻게 동작하며, 언제 K∞로 수렴하는가?
RQ4프레임워크가 고전적인 랜덤 매트릭스 결과(예: 반원 법칙, Marchenko-Pastur)를 특별한 경우로 회복할 수 있는가?
RQ5가중치 공유(전치) 가 다양한 아키텍처(RNN, CNN, 잔차, 어텐션)의 스케일링 한계에서 어떤 역할을 하는가?

주요 결과

DNN-GP 대응은 표준 아키텍처와 비선형성에 일반화되어 폭이 커지면 가우시안 프로세스 한계를 산출한다(코릴러 2.1).
그래디언트 독립성 가정은 특정 조건하에서 다항적으로 한정된 비선형성에 대해 올바른 역전파 동역학을 도출하며, 실패할 때 명시적 보정이 있다(코릴러 2.3).
배치 노름 없이 표준 아키텍처에서 초기화 시 Neural Tangent Kernel이 한계 K∞로 수렴한다(코릴러 2.4).
텐서 프로그램 프레임워크는 고전적인 랜덤 매트릭스 결과를 재도출하고 관련 알고리즘의 상태-진화 유사 분석과 관련지을 수 있다(예: AMP).
이 연구는 신호 전파와 그래디언트 동역학을 분석하는 일반적 방법을 제공하여 그래디언트 폭주/소실을 피하는 초기화 스킴 설계를 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.