[논문 리뷰] How to Start Training: The Effect of Initialization and Architecture
이 논문은 ReLU 네트워크의 두 가지 조기 학습 실패 모드를 엄밀히 분석하고, 적절한 초기화와 아키텍처—특히 ResNets의 경우—가 이러한 실패를 방지하여 더 깊은 네트워크의 학습을 가능하게 한다는 것을 보여준다. 이는 전적으로 연결형, 합성곱형, 잔차형 아키텍처에 걸친 이론적 결과와 경험적 검증을 제공한다.
We identify and study two common failure modes for early training in deep ReLU nets. For each we give a rigorous proof of when it occurs and how to avoid it, for fully connected and residual architectures. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly weighting the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided. In contrast, for fully connected nets, we prove that this failure mode can happen and is avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.
연구 동기 및 목표
- Identify failure modes that block early training in deep ReLU networks (FM1 and FM2).
- Provide rigorous conditions on initialization and architecture to avoid FM1 and FM2 in FC, Conv, and ResNet architectures.
- Demonstrate empirically how correct initialization and architecture predict training feasibility and depth.
- Compare behavior across fully connected, convolutional, and residual networks to explain empirical training success of ResNets.
제안 방법
- Define and analyze two failure modes: FM1 (mean activation length grows/shrinks exponentially with depth) and FM2 (variance of activation lengths across layers grows exponentially).
- Prove that FM1 can be avoided by initializing weights with symmetric distributions of variance 2/fan-in (and scaling residual modules in ResNets).
- Show that FM2 is avoided in ResNets once FM1 is avoided, while in fully connected nets FM2 depends on architecture via the sum of reciprocals of layer widths.
- Derive and state formal theorems (Theorem 1–Theorem 6) describing conditions under which FM1 and FM2 occur or are prevented in FC, Conv, and ResNet architectures.
- Extend results to convolutional architectures by replacing fan-in with the appropriate fan-in for conv layers and demonstrating similar behavior empirically.
실험 결과
연구 질문
- RQ1Under what initialization and architectural conditions do FM1 and FM2 occur in deep ReLU networks?
- RQ2How do FC, Conv, and ResNet architectures differ in their propensity for FM2, and how does this relate to training feasibility of deep networks?
- RQ3Can proper scaling of residual modules and weight variance enable training of significantly deeper ResNets?
- RQ4Do empirical activation lengths at initialization reliably predict early training performance across architectures?
주요 결과
- Initializing weights from a symmetric distribution with variance 2/fan-in prevents exploding/vanishing mean activation length (FM1) in FC and Conv nets.
- Correctly scaling residual modules in ResNets prevents FM1, and FM2 cannot occur in ResNets once FM1 is avoided (Corollary/Theorem 6).
- For fully connected and convolutional nets, FM2 depends on architecture and is mitigated by wider layers or by linear growth of width with depth; constant-width networks require width to grow roughly linearly with depth to avoid FM2.
- For residual networks, FM2 is largely architecture-independent once FM1 is avoided; residual modules weighted appropriately ensure stable activation lengths across depth.
- Empirically, networks initialized with correct variance and architecture exhibit successful training for greater depths, while popular initializations often fail FM1.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.