QUICK REVIEW

[논문 리뷰] Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci|arXiv (Cornell University)|2017. 06. 14.

Stochastic Gradient Optimization Techniques참고 문헌 28인용 수 168

한 줄 요약

이 논문은 과다 매개화된 신경망의 해시안(Hessian) 스펙트럼을 연구하여, 데이터에 의해 영향을 받는 소수의 이상값(outlier)과 거의 0에 가까운 고유값들의 대역(bulk)을 보이며, 이를 과다 매개화, 평탄도(flatness), 고차원 비볼록 최적화의 끌림 영역(basins of attraction)과 연결한다.

ABSTRACT

We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

연구 동기 및 목표

딥 러닝 손실 표면의 기하를 2차(해시안) 분석을 통해 동기 부여하고 이해한다.
해시안의 스펙트럼과 이를 해석 가능한 구성요소로 분해를 특징짓는다.
데이터의 복잡성, 모델 크기, 최적화 알고리즘이 해시안 고유값에 어떤 영향을 주는지 조사한다.
고차원 비볼록 최적화에서의 끌림 영역, 평탄도, 일반화에 대한 시사점을 제공한다.

제안 방법

임의의 초기점과 학습 후에 Hessian-벡터 곱을 이용해 정확한 해시안(Hessian)을 계산한다.
일반화된 Gauss-Newton 분해를 사용해 해시안을 공분산 유사 항과 두 번째 항(두 번째 미분을 포함)으로 표현한다( Equation 4 ).
국소 최솟값 근방에서 해시안이 최대 길이가 N인 랭크의 항에 의해 지배되어 많은 수의 거의 0에 가까운 고유값이 존재함을 보인다( Equation 5 ).
다중 군집 가우시안 데이터셋을 생성하고 SGD로 학습하여 클래스 수를 반영하는 이상값 고유값의 수를 관찰함으로써 데이터 복잡성을 varying하게 실험한다.
데이터를 고정된 채 네트워크 크기만 증가시켜 과다 매개화의 효과를 조사하고 대규모 고유값 스펙트럼의 변화(또는 변화 없음)를 관찰한다.
작은 배치와 큰 배치로 학습해 해시안 스펙트럼의 이상값에 미치는 영향을 비교하고 관찰한다.
스펙트럼의 맨 아래에 음의 고유값이 존재하는지와 그것이 모델 크기에 따라 어떻게 스케일링 되는지 살펴본다.

실험 결과

연구 질문

RQ1과다 매개화된 신경망의 해시안 스펙트럼이 벌크와 이상값으로 어떻게 분해되며, 각 부분을 무엇이 지배하는가?
RQ2데이터 복잡성, 모델 크기, 최적화 알고리즘이 큰 고유값과 전체 해시안 기하에 어떤 영향을 미치는가?
RQ3소형 배치와 대형 배치 최적화기가 서로 다른 베이스(또는 같은 평평한 영역)에 위치하는가?
RQ4스펙트럼과 손실 표면의 평탄도를 이해하는 데 일반화된 Gauss-Newton 분해를 사용하는 역할은 무엇인가?

주요 결과

해시안 스펙트럼은 벌크에 가까운 0 근처와 벌크에서 벗어난 몇 개의 이상값으로 분리된다.
데이터를 고정한 채 모델 크기를 늘려도 큰 고유값의 수는 변하지 않아 벌크는 규모화되고 이상값은 데이터에 의해 좌우된다는 점을 지지한다.
데이터가 더 복잡해질수록(클러스터가 더 많아질수록) 이상값의 수가 증가하며, 일부 실험에서는 이는 클래스 수에 거의 비례한다.
대형 배치 방법은 소형 배치 방법보다 이상값 고유값이 더 큰 경향이 있어 특정 방향에서의 지역적 곡률이 다름을 시사한다.
하한부의 음의 고유값이 존재하나 양의 이상값에 비해 크기가 작아 학습 종료 직전에도 비최적 곡률이 존재함을 시사한다.
대형 배치와 소형 배치 방법으로 얻은 두 해는 동일한 넓은 베이스에 있을 수 있으며, 평탄한 영역으로 연결되어 고립된 베이스의 개념에 도전한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.