QUICK REVIEW

[논문 리뷰] LDLT L-Lipschitz Network Weight Parameterization Initialization

Marius Juston, R.S. Sreenivas|arXiv (Cornell University)|2026. 01. 13.

Stochastic Gradient Optimization Techniques인용 수 0

한 줄 요약

이 논문은 Wishart 분포와 zonal polynomials을 이용하여 Gaussian 초기화 하에서 LDLT 기반 ℒ-Lipschitz 계층의 정확한 한계 출력 분산을 도출하고, 초기화 매개변수가 분산 보존 및 학습 다이나믹스에 어떤 영향을 미치는지 논의한다.

ABSTRACT

We analyze initialization dynamics for LDLT-based $\mathcal{L}$-Lipschitz layers by deriving the exact marginal output variance when the underlying parameter matrix $W_0\in \mathbb{R}^{m imes n}$ is initialized with IID Gaussian entries $\mathcal{N}(0,σ^2)$. The Wishart distribution, $S=W_0W_0^ op\sim\mathcal{W}_m(n,σ^2 \boldsymbol{I}_m)$, used for computing the output marginal variance is derived in closed form using expectations of zonal polynomials via James' theorem and a Laplace-integral expansion of $(α\boldsymbol{I}_m+S)^{-1}$. We develop an Isserlis/Wick-based combinatorial expansion for $\operatorname{\mathbb{E}}\left[\operatorname{tr}(S^k) ight]$ and provide explicit truncated moments up to $k=10$, which yield accurate series approximations for small-to-moderate $σ^2$. Monte Carlo experiments confirm the theoretical estimates. Furthermore, empirical analysis was performed to quantify that, using current He or Kaiming initialization with scaling $1/\sqrt{n}$, the output variance is $0.41$, whereas the new parameterization with $10/ \sqrt{n}$ for $α=1$ results in an output variance of $0.9$. The findings clarify why deep $\mathcal{L}$-Lipschitz networks suffer rapid information loss at initialization and offer practical prescriptions for choosing initialization hyperparameters to mitigate this effect. However, using the Higgs boson classification dataset, a hyperparameter sweep over optimizers, initialization scale, and depth was conducted to validate the results on real-world data, showing that although the derivation ensures variance preservation, empirical results indicate He initialization still performs better.

연구 동기 및 목표

LDLT 기반 ℒ-Lipschitz 네트워크의 가중치 초기화 동역학을 동기 부여하고 분석한다.
Gaussian 초기화된 LDLT 계층에 대한 정확한 한계 출력 분산을 도출한다.
초기화 매개변수 α와 σ^2가 분산 보존 및 그래디언트 특성에 미치는 영향을 보여준다.
깊은 ℒ-Lipschitz 네트워크에서 정보 손실을 완화하기 위한 실용적인 초기화 지침을 제공한다.

제안 방법

LDLT 계층의 순전파를 y = γ W0 (α I + W0^T W0)^(-1/2) x로 모델링하고 Var[y]를 계산한다.
Cov[y|W0]를 표현하고 Woodbury 항등식을 사용하여 (α I + WW^T)^(-1)과 관련지은다.
Var[y]를 E_W0[Tr((α I + S)^(-1))]로 표현하되 S = W0 W0^T이고 S ~ Wishart_m(n, σ^2 I)이다.
Laplace 적분 및 모멘트 전개를 사용하여 E[Tr(S^k)]를 k가 10까지 계산하고 James의 zonal polynomial 결과 및 Wick/Isserlis 전개를 이용한다.
작은-중간 범위의 σ^2에 대한 자르는 급수 근사를 제공하고 Monte Carlo로 검증한다.
분산 스케일링 및 그래디언트 트레이드 오프를 논의하고 초기 스케일(예: 10/√n) 및 α, γ의 함의를 포함한다.

Figure 1 : Variance difference estimation for weight parameterization sizes from 2 to 9

실험 결과

연구 질문

RQ1Gaussian 초기화하에 LDLT 기반 ℒ-Lipschitz 계층의 정확한 한계 분산은 무엇인가?
RQ2초기화 하이퍼파라미터 α, γ, 및 σ^2가 LDLT 네트워크에서 분산 보존 및 그래디언트 다이나믹에 어떤 영향을 미치는가?
RQ3LDLT 매개변수화가 깊이에서 단위 출력 분산을 달성할 수 있는가, 그리고 한계는 무엇인가?
RQ4잘려진 Wishart 모멘트 전개와 Monte Carlo 추정치를 실제로 어떻게 비교되는가?
RQ5일반 데이터 세트와 최적화 알고리즘에 대해 경험적 결과가 분산 보존 이론과 일치하는가?

주요 결과

LDLT 계층의 정확한 한계 분산은 Var[y] = γ^2/m (m − α E[Tr((α I_m + S)^(-1))]), S ~ Wishart_m(n, σ^2 I) 로 표현될 수 있다.
정확한 Laplace 표현과 고차 Wishart 모멘트를 이용하여 E[Tr(S^k)]를 k = 10까지 근사하고, 작은-중간 σ^2에 대한 분산 추정치를 정확하게 얻는다.
분산은 σ^2에 비례하여 스케일링되며, 단위 분산에 근접하려면 더 큰 σ^2가 필요하지만, 너무 큰 σ^2는 Lipschitz 매니폴드의 포화로 인해 그래디언트 소실을 초래할 수 있다.
실험적으로 He/ Kaiming 스타일 초기화(스케일링 1/√n) 사용 시 출력 분산이 약 0.41 근처이고, α = 1일 때 10/√n 스케일링은 분산을 약 0.9에 가깝게 만들어 분산 보존 가능성을 보여준다.
Higgs 데이터셋에서 분산 보존 이론이 항상 더 우수한 실험적 성능으로 이어지지는 않으며, 특정 조건에서는 He 초기화가 여전히 실무에서 더 잘 수행될 수 있다.
역전파 분석은 순방향 결과를 반영하여 그래디언트의 분산 동향이 비슷하다는 것을 시사한다.

Figure 2 : Variance difference estimation for weight parameterization sizes from 10 to 90

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.