QUICK REVIEW

[論文レビュー] Learning Sparse Neural Networks through $L_0$ Regularization

Christos Louizos, Max Welling|arXiv (Cornell University)|Dec 4, 2017

Gaussian Processes and Bayesian Inference参考文献 29被引用数 149

ひとこと要約

この論文は、トレーニング中に重みを剪定する確率的ゲートを学習することによってニューラルネットワークにおける L_{0} ノルム正則化の実用的な枠組みを提供し、正確なゼロと条件付き計算を差分化可能な最適化で実現する。

ABSTRACT

We propose a practical method for $L_0$ norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. AIC and BIC, well-known model selection criteria, are special cases of $L_0$ regularization. However, since the $L_0$ norm of weights is non-differentiable, we cannot incorporate it directly as a regularization term in the objective function. We propose a solution through the inclusion of a collection of non-negative stochastic gates, which collectively determine which weights to set to zero. We show that, somewhat surprisingly, for certain distributions over the gates, the expected $L_0$ norm of the resulting gated weights is differentiable with respect to the distribution parameters. We further propose the \emph{hard concrete} distribution for the gates, which is obtained by "stretching" a binary concrete distribution and then transforming its samples with a hard-sigmoid. The parameters of the distribution over the gates can then be jointly optimized with the original network parameters. As a result our method allows for straightforward and efficient learning of model structures with stochastic gradient descent and allows for conditional computation in a principled way. We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer.

研究の動機と目的

深層ネットワークにおける計算量削減と一般化向上のための疎性とモデル圧縮を動機づける。
パラメータの正確なゼロを保持する非微分可能な L_{0} ノルムの微分可能な代理を開発する。
勾配法を用いたネットワークパラメータとゲート分布パラメータの共同最適化を可能にする。
訓練中に疎性を誘導して条件付き計算とスピードアップを可能にする。
標準ベンチマークで疎性と精度の競合的トレードオフを示す。

提案手法

theta_j = tilde_theta_j * z_j で z_j ∈ {0,1}、$L_{0}$ を活性パラメータのゲート付きカウントへ変換。
離散ゲートを、$z_j \sim Bernoulli(\pi_j)$ の確率的ゲートを導入して連続的代理を介して効率的な勾配最適化を実現。
ハードシグモイドゲート z = clamp(s) を用いた補助的な連続変数 s を用いて、正確なゼロを許容しつつ再パラメータ化を可能にする滑らかな目的関数を定義。
binary concrete 分布を伸ばして hard-sigmoid を適用することでゲートをモデル化する concrete (hard-concrete) 分布を用い、ゲートパラメータ phi の微分可能な学習を可能にする。
L0 ペナルティをアクティブゲートの期待値として表し、再パラメータ化を用いたモンテカルロ推定で最適化する。
必要に応じて L0 と L2 正則化を組み合わせ、パラメータ群間でゲートを共有してグループ疎を拡張する。

実験結果

リサーチクエスチョン

RQ1重みの正確なゼロを保持したままニューラルネットワークで L0 正則化を効率的に最適化できるだろうか？
RQ2ハードコンクリートゲートは訓練中の効果的な剪定を可能にする勾配フレンドリーな代理として適切か？
RQ3ゲートと重みのパラメータを同時に学習すると、精度競争力のある疎化モデルと潜在的な計算スピードアップを得られるか？
RQ4提案手法は、既存の疎合意とドロップアウトベース正則化と標準ベンチマークでどのように比較されるか？

主な発見

本法は、MNISTとCIFARベンチマークにおいて、既存の剪定法と比較して競争力のあるテスト精度を維持しつつ疎化アーキテクチャを生み出す。
L0 正則化で訓練されたニューラルネットワークは、剪定がコストに影響を与える層、例えば入力層や特定のネットワークの最初の全結合層などでより積極的に剪定される傾向がある。
このアプローチは訓練中に浮動小数点演算の徐々の削減を可能にし、条件付き計算に類似した訓練速度向上の可能性を示す。
CIFARデータセットでは、L0 正則化を適用したワイド残差ネットワークは、特定の正則化強度下でドロップアウトのベースラインを改善しつつ、疎性により追加のスピードアップが可能となる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。