QUICK REVIEW

[논문 리뷰] Mean Field Residual Networks: On the Edge of Chaos

Greg Yang, Samuel S. Schoenholz|arXiv (Cornell University)|2017. 12. 24.

Neural Networks and Applications참고 문헌 8인용 수 31

한 줄 요약

이 논문은 무작위로 초기화된 잔차 신경망에 대한 평균장 이론 분석을 제안하며, 스킵 연결 덕분에 지수적(일반적으로 다항식)으로 증가하는 전방 및 역방향 동역학을 보이며, 혼돈의 경계에서 작동함을 보여준다. 주요 기여는 초기화 하이퍼파rameter로부터 네트워크 성능을 예측하는 이론적이고 실증적인 프레임워크를 제공하는 것으로, 최적의 분산은 깊이에 따라 달라지며 Xavier 또는 He 초기화와는 근본적으로 다름을 드러낸다.

ABSTRACT

We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the "edge of chaos" hypothesis, these subexponential and polynomial laws allow residual networks to "hover over the boundary between stability and chaos," thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind.

연구 동기 및 목표

무작위로 초기화된 잔차 신경망의 동역학적 행동을 평균장 이론을 통해 이해하기 위해.
스킵 연결이 순수 피드포워드 신경망과 비교해 전방 및 역방향 전파 동역학을 어떻게 변화시키는지 특성화하기 위해.
잔차 신경망의 최적 초기화 분산이 깊이와 비선형성에 따라 어떻게 달라지는지 특정하기 위해.
초기화 하이퍼파rameter와 테스트 시 성능 사이에 예측 가능한 연결 고리를 확립하기 위해.
ReLU 유사 비선형성에 대해 브레스텔 함수를 포함한 새로운 수학적 항등식 유도하기 위해.

제안 방법

입력 벡터 간 코사인 거리의 변화를 분석하기 위해 평균장 이론을 적용한다.
차분 방정식과 固定点 분석을 사용하여 활성화 및 기울기 흐름의 동역학을 모델링한다.
네트워크 깊이와 비선형성에 따라 기울기 분산과 입력 거리의 증가에 대한 정확한 점근적 표현을 유도한다.
기울기 폭주 또는 입력 거리와 같은 초기화 시 지표를 기반으로 테스트 시 성능을 예측하는 새로운 프레임워크를 도입한다.
적분 표현 및 브레스텔 함수를 포함한 고급 수학적 도구를 사용하여 α-ReLU 비선형성을 분석한다.
다양한 활성화 함수와 하이퍼파rameter를 사용한 MNIST에서의 실증 실험을 통해 이론적 예측을 검증한다.

실험 결과

연구 질문

RQ1잔차 신경망의 스킵 연결은 순수 피드포워드 신경망과 비교해 전방 및 역방향 동역학을 어떻게 변화시키는가?
RQ2잔차 신경망에서 입력 벡터 간 코사인 거리의 점근적 수렴 속도는 무엇인가?
RQ3무작위 초기화에도 불구하고 잔차 신경망이 표준 신경망보다 일반화 성능이 더 좋은 이유는 무엇인가?
RQ4잔차 신경망의 최적 초기화 분산은 깊이와 비선형성에 따라 어떻게 달라지는가?
RQ5초기화 시 계산된 성질로부터 훈련된 신경망의 성능을 예측할 수 있는가?

주요 결과

잔차 신경망은 지수적 수렴이 아니라 다항식 수렴을 보이며, 이는 혼돈의 경계에 있음을 시사한다.
α < 1인 α-ReLU의 경우, 기울기 분산이 깊이에 따라 다항식적으로만 증가하여 지수적 폭주를 피한다.
초기화 시 기울기 폭주 및 입력 거리에 대한 이론적 예측이 다양한 아키텍처와 하이퍼파rameter에서 테스트 시 성능을 정확히 예측한다.
잔차 신경망의 최적 초기화 분산은 깊이와 비선형성에 따라 달라지며, 이는 Xavier 및 He 초기화의 가정과 근본적으로 다름을 드러낸다.
논문은 ReLU의 거듭제곱의 커널과 제0종 제2형 브레스텔 함수를 연결하는 새로운 항등식을 도출한다.
실증 결과는 tanh 잔차 신경망의 경우 기울기 폭주가 성능을 결정짓는 반면, (α-)ReLU 신경망의 경우 표현력(입력 거리)이 주요 요소임을 확인한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.