QUICK REVIEW

[논문 리뷰] Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Lénaïc Chizat, Francis Bach|arXiv (Cornell University)|2020. 02. 11.

Stochastic Gradient Optimization Techniques인용 수 35

한 줄 요약

이 논문은 무한히 넓은 두 계층 네트워크가 exponential-tailed losses로 훈련될 때의 gradient flow의 implicit bias를 분석하여 variation-norm 공간에서 max-margin classifier로 수렴함을 보이고, 숨겨진 저차원 구조 아래 차원에 독립적인 일반화를 강조하며; 또한 두 계층을 모두 훈련시키는 경우와 출력층만 훈련시키는 경우를 대조하고 수치 실험으로 이를 뒷받침한다.

ABSTRACT

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.

연구 동기 및 목표

과다한 매개변수화된 네트워크가 그래디언트 방법으로 일반화가 잘 되는 이유에 대한 이해를 촉진한다.
2-homogeneous activations를 가진 무한히 넓은 두 계층 네트워크에서 학습 역학의 극한을 특성화한다.
학습된 분류기가 함수적 노름에서 convex max-margin 문제를 해결함을 보인다.
두 층을 모두 훈련시키는지 아니면 출력층만 훈련시키는지에 따른 암시적 편향을 비교한다.
주변 차원이 아닌 저차원 구조에 의존하는 일반화 경계를 제공한다.

제안 방법

균형 특성 함수 φ를 갖는 2-동질성 2계층 네트워크로 모델링하고 m→∞ 극한을 연구한다.
예측자를 h(μ, x)=∫φ(w, x)dμ(w)로 표현하고 이를 2-homogeneous 구 Π2(μ)로 투사한다.
두 가지 max-margin 개념을 정의한다: variation-norm (F1)과 RKHS norm (F2), 이를 통해 γ1과 γ2 여유를 얻는다.
매끄러운 마진 목적함수 S(ĥ(μ))의 gradient flow를 분석하고 확률 측정에 대한 Wasserstein gradient flow로 해석한다.
적당한 조건 하에서 극한 방향 ν̄∞가 F1-max-margin 문제 Eq. (4)를 해결함을 증명한다.
출력층만 훈련시키는 경우를 포함한 특수한 사례를 논의하고 수렴에 관한 통찰을 제공한다.

실험 결과

연구 질문

RQ1무한히 넓은 두 계층 네트워크가 logistic 또는 exponential-tailed losses로 훈련될 때의 gradient flow의 암시적 편향은 무엇인가?
RQ2학습 역학이 기능적 노름에서 최대 마진 분류기로 수렴하는가, 그리고 이 수렴은 매개변수화에 어떻게 의존하는가?
RQ3출력층만 훈련시키거나 뉴런 방향을 고정하는 경우가 암시적 편향과 수렴 행태에 어떤 영향을 미치는가?
RQ4숨겨진 저차원 구조가 존재하는 경우 암시적 최대 마진의 통계적 일반화 함의는 무엇인가?

주요 결과

지수-tail losses를 갖는 넓은 두 계층 네트워크의 gradient flow 극한은 variation-norm 공간에서 max-margin classifier를 산출한다(F1).
뉴런 방향을 고정하거나 출력층만 훈련시키는 경우 역학은 매끄러운 마진 목적함수에 대한 온라인 미러 상승으로 매핑되며 여마진-최적화를 초래한다.
γ1은 고차원에서도 숨겨진 저차원 구조가 존재하면 크게 유지될 수 있어 차원에 독립적인 일반화 경계가 가능하다.
2계층 ReLU 네트워크에 대한 실험은 이론적 암시적 편향의 행태와 고차원에서의 통계적 이점을 뒷받침한다.
두 층을 모두 훈련시키면 비연속적인 F1-max-margin 분류기가, 출력층만 훈련시키면 RKHS 유사한 매끄러운 F2-max-margin 분류기가 생겨 서로 다른 질적 결정 경계를 가진다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.