QUICK REVIEW

[論文レビュー] Where is the Information in a Deep Neural Network?

Alessandro Achille, Paolini, Giovanni|arXiv (Cornell University)|May 29, 2019

Stochastic Gradient Optimization Techniques参考文献 49被引用数 47

ひとこと要約

この論文は DNN の重みの情報 IW (Information in the Weights) を、摂動によって生じる損失の変化と訓練データに対する符号化長とのトレードオフとして定義・分析し、IW を PAC-Bayes の境界を通じて一般化と結びつけ、重み情報と活性化の不変性をファイシャー情報を用いて関連づける。

ABSTRACT

Whatever information a deep neural network has gleaned from training data is encoded in its weights. How this information affects the response of the network to future data remains largely an open question. Indeed, even defining and measuring information entails some subtleties, since a trained network is a deterministic map, so standard information measures can be degenerate. We measure information in a neural network via the optimal trade-off between accuracy of the response and complexity of the weights, measured by their coding length. Depending on the choice of code, the definition can reduce to standard measures such as Shannon Mutual Information and Fisher Information. However, the more general definition allows us to relate information to generalization and invariance, through a novel notion of effective information in the activations of a deep network. We establish a novel relation between the information in the weights and the effective information in the activations, and use this result to show that models with low (information) complexity not only generalize better, but are bound to learn invariant representations of future inputs. These relations hinge not only on the architecture of the model, but also on how it is trained, highlighting the complex inter-dependency between the class of functions implemented by deep neural networks, the loss function used for training them from finite data, and the inductive bias implicit in the optimization.

研究の動機と目的

Define Information in the Weights as the trade-off between perturbation-induced loss changes and coding length relative to training data.
Relate the information in weights to generalization through PAC-Bayes bounds.
Introduce and formalize the notion of effective information in activations and connect it to weight information.
Derive relations between Fisher Information and Shannon information, and show how training dynamics influence these quantities.
Highlight the dependency of information measures on architecture, loss, and optimization, and discuss practical encoding choices.

提案手法

Define Information in the Weight (IW) with a pre-distribution P and post-distribution Q over weights and a beta-controlled objective to minimize L_D plus beta times KL(Q||P).
Show that at beta=1 the IW formalism reduces to the ELBO used in Bayesian neural networks, without requiring a Bayesian posterior.
Relate IW to generalization using the PAC-Bayes bound, yielding bounds on test loss in terms of training loss and KL(Q||P).
Specialize IW to Shannon information by choosing P and Q to minimize the bound in expectation, yielding I(w;D).
Specialize IW to Fisher information by assuming Gaussian pre/post distributions and linking the KL term to log-determinant of the Fisher (and Hessian) under small-beta approximations.
Demonstrate that Fisher information controls invariance properties while Shannon information controls generalization, and discuss their first-order connection under stochastic optimization.

実験結果

リサーチクエスチョン

RQ1How can information about training data retained in network weights be quantified in a computable way for large DNNs?
RQ2What is the relationship between the information in the weights and the information in activations with respect to generalization and invariance?
RQ3How do different information measures (Shannon vs Fisher) relate within the weight-activation framework under stochastic optimization?
RQ4How do architectural choices, loss functions, and optimization dynamics jointly influence the information content, generalization, and invariance of learned representations?
RQ5Can bounds on test loss be derived from the Information in the Weights via PAC-Bayes and relate to activation invariance?

主な発見

The Information in the Weights (IW) is defined as the KL divergence between a post-distribution over weights and a pre-distribution, penalized by the expected training loss.
IW bounds generalization through a PAC-Bayes bound on test loss, linking training behavior to performance on unseen data.
Under Gaussian encoding choices, IW reduces to Fisher information, connecting to curvature and stability of the learned solution.
Shannon information about the dataset can be recovered as the expected IW under an adapted prior, connecting IW to I(w;D).
Fisher information controls invariance to nuisances, while Shannon information controls generalization; SGD dynamics tend to couple these measures through flat minima and stability, tying optimization geometry to information content.
The framework shows a tight interdependence between network architecture, training loss, and optimization in determining what information is retained and how representations generalize.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。