QUICK REVIEW

[논문 리뷰] Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee|arXiv (Cornell University)|2018. 10. 12.

Stochastic Gradient Optimization Techniques참고 문헌 78인용 수 46

한 줄 요약

이 논문은 명시적 L2 정규화가 있을 때 신경망이 일반화 능력을 더 잘 달성하고 O(d) 샘플만으로 학습할 수 있는 반면 NTK 기반 커널은 Omega(d^2) 샘플이 필요할 수 있음을 보여주며, 또한 무한 너비 한계에서 정규화 하에 최적화의 다항시간 수렴을 증명한다.

ABSTRACT

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $Ω(d^2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable. We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.

연구 동기 및 목표

(1) 과도한 파라미터화와 명시적 규제화가 NTK 분석을 넘어서 일반화에 어떤 영향을 미치는지 동기 부여한다.
(2) 정규화된 네트가 NTK 실패(O(d^2) 샘플)의 경우에도 O(d) 샘플로 성공하는 구체적인 데이터 분포를 보여준다.
(3) 약한 규제화와 최대 마진 해 사이의 관계를 연결하는 이론적 도구를 개발하고 마진 기반 일반화 경계를 증명한다.
(4) 무한 너비의 규제된 네트워크가 변동된 Wasserstein 그래디언트 흐름을 통해 다항시간 내에 전역 최소점으로 최적화될 수 있음을 보인다.]
method=[
D를 d 차원으로 구성하되 신호가 처음 두 좌표에 집중되도록 분포를 구성한다.
아키텍처에 의해 유도된 NTK 커널과 L2-정규화 로지스틱 손실로 학습된 2-층 ReLU 네트워크를 분석한다.
약하게 규제된 NN이 최대 마진 해로 수렴하고 일반화가 잘되는지 증명한다.
잡음이 있는 Wasserstein 그래디언트 흐름을 도입하고 무한 너비 네트워크에 대해 다항시간 내에 전역 최소점으로 수렴함을 증명한다.]
research_questions=[
명시적 L2 규제화가 NTK 커널보다 더 나은 마진과 일반화를 가능하게 하는가?
정규화된 신경망과 NTK 기반 방법 간의 샘플 복잡도 차이는 구성된 데이터 분포에서 얼마인가?
무한 너비 한계에서 효율적인 최적화를 통해 정규화된 전역 최적해를 얻을 수 있는가?
약한 규제화가 깊은 아키텍처에서도 최적점을 최대 마진 해로 밀어 올리는가?

제안 방법

3-2줄은 한국어로 번역되지 않도록 비워둡니다.
신호가 처음 두 좌표에 집중되도록 차원 d의 분포 D를 구성한다.
아키텍처에 의해 유도된 NTK 커널에 비해 L2-정규화 로지스틱 손실로 학습된 이차원 ReLU 네트워크를 분석한다.
약하게 규제된 로지스틱 손실의 글로벌 최적점이 동일 아키텍처의 네트워크들 중에서 최대 정규화 마진을 달성함을 증명한다.
변동된 Wasserstein 그래디언트 흐름을 도입하고 무한 너비 네트워크에 대한 전역 최소점으로의 다항시간 수렴을 증명한다.]
research_questions=[
명시적 L2 규제화가 NTK 커널보다 나은 마진과 일반화를 가능하게 하는가?
정규화된 신경망과 NTK 기반 방법 사이의 샘플 복잡도 차이는 구성된 데이터 분포에서 얼마나 되는가?
무한 너비 한계에서 효율적인 최적화를 통해 정규화된 전역 최적점에 도달할 수 있는가?
약한 규제화가 깊은 네트워크에서도 최적자를 최대 마진 해로 이끄는가?

실험 결과

연구 질문

RQ1Can explicit L2 regularization enable neural nets to achieve better margins and generalization than the NTK kernel?
RQ2What is the sample complexity gap between regularized neural nets and NTK-based methods on a constructed data distribution?
RQ3Is the regularized global optimum attainable via efficient optimization in the infinite-width limit?
RQ4Does weak regularization push the optimizer toward max-margin solutions across deep architectures?

주요 결과

정규화된 신경망은 구성된 분포에서 O(d) 샘플로도 일반화가 잘 되지만 NTK는 Omega(d^2) 샘플이 필요하다.
약하게 규제된 로지스틱 손실의 전역 최적점은 같은 아키텍처의 네트워크 중에서 최대 정규화 마진을 달성한다.
네트워크 너비가 증가함에 따라 최대 가능 마진이 비감소적으로 증가하는 넓이-과다 파라미터화의 이점을 보인다. 일반화 경계가 개선된다.
무한 너비의 이차층 네트워크에 대해 노이즈가 있는 그래디언트 디센트가 규제된 손실을 다항시간 내에 전역 최소점으로 최적화한다.
실험적 시뮬레이션은 명시적 규제화가 없는 경우와 비교해 마진과 테스트 정확도가 개선됨을 뒷받침한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.