QUICK REVIEW

[논문 리뷰] Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Jiaoyang Huang, Horng‐Tzer Yau|arXiv (Cornell University)|2019. 09. 18.

Model Reduction and Neural Networks참고 문헌 34인용 수 42

한 줄 요약

논문은 깊은 네트워크의 유한 폭 경사하강 역학을 설명하기 위한 신경 접선 계층(NTH)을 제시하고, NTK의 변화가 1/m의 순서로 발생함을 보이며, 조정 가능한 정확도로 NTK 역학을 근사하는 절단을 제안한다.

ABSTRACT

The evolution of a deep neural network trained by the gradient descent can be described by its neural tangent kernel (NTK) as introduced in [20], where it was proven that in the infinite width limit the NTK converges to an explicit limiting kernel and it stays constant during training. The NTK was also implicit in some other recent papers [6,13,14]. In the overparametrization regime, a fully-trained deep neural network is indeed equivalent to the kernel regression predictor using the limiting NTK. And the gradient descent achieves zero training loss for a deep overparameterized neural network. However, it was observed in [5] that there is a performance gap between the kernel regression using the limiting NTK and the deep neural networks. This performance gap is likely to originate from the change of the NTK along training due to the finite width effect. The change of the NTK along the training is central to describe the generalization features of deep neural networks. In the current paper, we study the dynamic of the NTK for finite width deep fully-connected neural networks. We derive an infinite hierarchy of ordinary differential equations, the neural tangent hierarchy (NTH) which captures the gradient descent dynamic of the deep neural network. Moreover, under certain conditions on the neural network width and the data set dimension, we prove that the truncated hierarchy of NTH approximates the dynamic of the NTK up to arbitrary precision. This description makes it possible to directly study the change of the NTK for deep neural networks, and sheds light on the observation that deep neural networks outperform kernel regressions using the corresponding limiting NTK.

연구 동기 및 목표

gradient flow 하에서 깊고 완전연결 네트워크의 학습 역학을 동기화하고 분석한다.
NTK 데이터를 의존하고 폭에 민감한 역학을 포착하는 무한 계층(NTH)을 도출한다.
상향식 커널의 고차(K_t^(r))에 대한 사전 상한을 제시하고 NTK 변화가 O(1/m)임을 보인다.
충분히 큰 폭에서 NTK 역학을 임의의 정밀도로 근사하는 절단된 NTH를 제안한다.

제안 방법

H개의 은닉층과 W^(l) 가중치를 갖는 깊은 완전연결 네트워크에 대해 연속 시간 경사하강(gradient flow)을 형식화한다.
레이어별 커널 G_t^(l)의 합으로서의 신경 접선 커널 K_t^(2)(·,·)를 정의하고 데이터 의존성을 보인다.
무한 개의 상미분방정식(Ode) 시스템인 신경 접선 계층(NTH)을 도출하여 f(t)와 r≥2인 고차 커널 K_t^(r) 사이의 관계를 나타낸다.
고차 커널 K_t^(r)에 대한 사전 경계(bound)를 확립하고 K_t^(2)의 변화가 O(1/m)임을 증명한다.
∂_t K_t^(p)=0으로 설정하여 절단된 NTH를 도입하고 그 근사 오차를 분석한다.
수렴 결과와 어떤 조건에서 경사하강이 훈련 손실을 0으로 수렴시키는지(가정하에 선형/지수적 속도)를 보인다.

실험 결과

연구 질문

RQ1유한 폭 심층 네트워크의 gradient flow 역학이 그 진화를 설명하는 정확한 무한 계층(NTH)을 허용하는가?
RQ2고차 NTK 유사 커널 K_t^(r)은 어떻게 작용하며 사전 경계가 가능한가?
RQ3학습 중 NTK 변화가 1/m의 차이를 보이는가, 이것이 일반화 및 학습 역학에 어떤 함의를 가지는가?
RQ4실용적 폭에서 NTK 역학을 정확히 근사하는 유한 수준의 NTH 절단이 가능한가, 폭이 근사 오차에 어떤 영향을 미치는가?
RQ5너른 네트워크에서 경사하강이 훈련 손실을 0으로 수렴하는 조건은 무엇이며, 이전 결과보다 이를 개선할 수 있는가?

주요 결과

깊은 네트워크의 경사하강 역학은 무한 신경 접선 계층(NTH)으로 설명될 수 있다.
사전 추정이 가능한 결정론적 고차 커널들이 존재하며, NTK 변화는 주어진 가정하에서 O(1/m)이다.
절단된 NTH는 NTK 역학에 대한 오차를 제어 가능한 방식으로 근사하며, 폭이 커질수록 이 오차가 감소한다.
m이 n^3 이상일 때 절단된 계층은 특정 시간까지 NTK 역동을 잘 추적하며, 오차항은 m이 커질수록 감소한다.
K_0^(2)의 양의 최소 고유값 조건 하에서, 충분히 넓은 네트워크의 경우 gradient flow가 훈련 오차를 지수적으로(선형 속도) 감소시킨다.
더 넓은 네트워크(m이 클수록) 더 긴 유효 근사 시간과 더 작은 절단 오차를 가능하게 하며, NTK 진화와 학습 성능에 폭-깊이의 이점이 있음을 시사한다.
보조정리들은 관련 결과에서 수렴 보장을 위한 폭의 요구 조건을 이전의 4차 한계에서 3차로 개선한다는 것을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.