QUICK REVIEW

[論文レビュー] Understanding training and generalization in deep learning by Fourier analysis

Zhi‐Qin John Xu|arXiv (Cornell University)|Aug 13, 2018

Stochastic Gradient Optimization Techniques参考文献 17被引用数 42

ひとこと要約

本論文は DNN 訓練に対する Fourier解析フレームワークを提案し、勾配ベースの手法が低周波成分を優先し、小さな初期化が良い一般化を促進しつつ任意の関数を適合する能力を保持することを示す。

ABSTRACT

Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)---equipped with many more parameters than training data and trained by (stochastic) gradient-based methods---often achieve remarkably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods often endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN's ability to fit any function. These results are further confirmed by experiments of DNNs fitting the following datasets, that is, natural images, one-dimensional functions and MNIST dataset.

研究の動機と目的

Explain why DNNs trained with gradient-based methods generalize well despite large parameter counts.
Show how gradient dynamics favor low-frequency components of the target function.
Demonstrate how initialization scale affects the trade-off between fitting high-frequency components and generalization.
Extend the framework qualitatively to general DNNs and validate with experiments on natural images, 1-D functions, and MNIST.

提案手法

Develop a theoretical framework in the Fourier domain for DNNs with tanh activations and one hidden layer as illustration.
Derive the frequency-domain form of the DNN output and the loss, and obtain gradients with respect to parameters.
Show that the gradient magnitude for each frequency component decomposes into a decay term with frequency and the error amplitude.
Prove theorems indicating lower frequencies receive training priority and conditions under which low-frequency convergence is preserved.
Argue the qualitative extension of the framework to general DNNs and discuss the role of activation spectra.
Empirically validate the theory with experiments on natural images, 1-D functions, and MNIST, with comparisons of small vs large initialization.

実験結果

リサーチクエスチョン

RQ1Do gradient-based training dynamics preferentially reduce errors in low-frequency components of the target function?
RQ2How does initialization scale influence the frequency content of the learned function and generalization performance?
RQ3Can the Fourier-analysis framework qualitatively extend to general DNN architectures beyond the illustrative one-hidden-layer model?
RQ4What empirical evidence supports the frequency-priority behavior on real-world datasets like natural images and MNIST?

主な発見

Low-frequency components of the target function are given higher training priority under gradient-based optimization.
Small initialization leads to smaller high-frequency amplitudes and better generalization, while still allowing the network to fit any function.
The decay term in the frequency-domain gradient is tied to the activation function and weight scales, guiding frequency learning order.
For large networks, the spectral norm changes little during training, yet the framework can still qualitatively explain observed training dynamics.
Experiments on natural images and MNIST illustrate the frequency-prioritization and the impact of initialization on generalization.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。