QUICK REVIEW

[논문 리뷰] Exponentially vanishing sub-optimal local minima in multilayer neural networks

Daniel Soudry, Elad Hoffer|arXiv (Cornell University)|2017. 02. 19.

Neural Networks and Applications참고 문헌 33인용 수 54

한 줄 요약

한 은닉층 MNN에서 piecewise linear 유닛으로 MSE로 학습할 때, mild over-parameterization 및 Gaussian 입력 가정 아래 전역 최소점에 비해 sub-optimal local minima를 포함하는 differentiable region의 부피가 지수적으로 감소한다.

ABSTRACT

Background: Statistical mechanics results (Dauphin et al. (2014); Choromanska et al. (2015)) suggest that local minima with high error are exponentially rare in high dimensions. However, to prove low error guarantees for Multilayer Neural Networks (MNNs), previous works so far required either a heavily modified MNN model or training method, strong assumptions on the labels (e.g., "near" linear separability), or an unrealistic hidden layer with $Ω\left(N ight)$ units. Results: We examine a MNN with one hidden layer of piecewise linear units, a single output, and a quadratic loss. We prove that, with high probability in the limit of $N ightarrow\infty$ datapoints, the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima, given standard normal input of dimension $d_{0}= ildeΩ\left(\sqrt{N} ight)$, and a more realistic number of $d_{1}= ildeΩ\left(N/d_{0} ight)$ hidden units. We demonstrate our results numerically: for example, $0\%$ binary classification training error on CIFAR with only $N/d_{0}\approx 16$ hidden neurons.

연구 동기 및 목표

과적합 매개변수화된 MNN에서 SGD가 학습 오차가 낮은 해를 찾는 이유를 이해하려는 동기 부여.
sub-optimal local minima의 현황을 분석하기 위한 현실적인 MNN 설정을 제시.
sub-optimal 영역이 global minima보다 기하급수적으로 드문다는 확률적 경계 도출.
실용적인 네트워크 크기에서 over-parameterization이 sub-optimal minimization의 감소에 기여하는 정도를 정량화하려 함.

제안 방법

piecewise linear 유닛을 가진 단일 은닉층을 갖는 2-layer MNN과 스칼라 출력 분석.
손실 함수로 mean square error(MSE)를 사용하고 differentiable local minima(DLM)을 분석의 초점으로 삼음.
활성화 패턴이 고정되는 differentiable regions를 정의하고 잔차 오차를 랭크 조건 (A ∘ X) e = 0 와 관련지움.
무작위 가우시안 초기화 하에서 파라미터 영역의 확률 측정으로 각도 볼륨(angular volume)을 도입.
sub-optimal DLM의 각도 볼륨에 대한 상한과 global minima의 각도 볼륨에 대한 하한을 증명.
볼륨의 비율을 한정하는 주된 정리를 확립하여 sub-optimal 영역이 global minima에 비해 지수적으로 사라짐을 보임.

실험 결과

연구 질문

RQ1고차원에서 sub-optimal differentiable local minima가 지수적으로 드물어지는 조건은 무엇인가?
RQ2은닉층 너비와 입력 차원 측면에서 over-parameterization이 sub-optimal 영역과 global minima를 포함하는 영역의 부피에 어떻게 영향을 미치는가?
RQ3실용적인 가정(가우시안 입력, 완화된 over-parameterization) 하에서 MNN를 수정하지 않고 학습 방법을 바꾸지 않아도 낮은 학습 오차에 대한 보장을 이론적으로 얻을 수 있는가?

주요 결과

주어진 가정하에서 MCE > ε인 sub-optimal DLM의 기대 각도 볼륨은 N에 대해 지수적으로 작아진다.
Global minima는 높은 확률로 존재하며 비미미한 각도 볼륨을 가지므로 sub-optimal 영역과의 의미 있는 비교가 가능하다.
V(Lε) / V(G)의 비율은 exp(-γε N^{3/4} (d1 d0)^{1/4})로 상한되며, 또한 ≤ exp(-γε N log N) 이므로 sub-optimal 영역의 지수적 희소성이 나타난다.
가우시안 데이터 및 실제 데이터셋(MNIST, CIFAR, ImageNet)에 대한 수치 실험은 비교적 적은 매개변수 수(N 파라미터 내외)로 학습 오차가 거의 0에 가깝음을 보여 이론과 일치한다.
비미분 가능한 임계점은 숫자적으로 드물게 나타나며, 주요 결과는 differentiable local minima에 초점을 맞춘다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.