QUICK REVIEW

[論文レビュー] Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Matthew Mackay, Paul Vicol|arXiv (Cornell University)|Mar 7, 2019

Advanced Neural Network Applications参考文献 58被引用数 54

ひとこと要約

STNs は、コンパクトなハイパーネットワークベースのゲートを用いて最適応答を近似することでオンラインでハイパーパラメータを学習し、離散的および確率的なハイパーパラメータを可能にし、PTB、CIFAR-10 ほかで性能を向上させるスケジュールを生み出します。

ABSTRACT

Hyperparameter optimization can be formulated as a bilevel optimization\nproblem, where the optimal parameters on the training set depend on the\nhyperparameters. We aim to adapt regularization hyperparameters for neural\nnetworks by fitting compact approximations to the best-response function, which\nmaps hyperparameters to optimal weights and biases. We show how to construct\nscalable best-response approximations for neural networks by modeling the\nbest-response as a single network whose hidden units are gated conditionally on\nthe regularizer. We justify this approximation by showing the exact\nbest-response for a shallow linear network with L2-regularized Jacobian can be\nrepresented by a similar gating mechanism. We fit this model using a\ngradient-based hyperparameter optimization algorithm which alternates between\napproximating the best-response around the current hyperparameters and\noptimizing the hyperparameters using the approximate best-response function.\nUnlike other gradient-based approaches, we do not require differentiating the\ntraining loss with respect to the hyperparameters, allowing us to tune discrete\nhyperparameters, data augmentation hyperparameters, and dropout probabilities.\nBecause the hyperparameters are adapted online, our approach discovers\nhyperparameter schedules that can outperform fixed hyperparameter values.\nEmpirically, our approach outperforms competing hyperparameter optimization\nmethods on large-scale deep learning problems. We call our networks, which\nupdate their own hyperparameters online during training, Self-Tuning Networks\n(STNs).\n

研究の動機と目的

ハイパーパラメータが学習重みに依存する二階層問題として、ハイパーパラメータの最適化を動機づける。
ニューラルネットワーク向けにスケーラブルでメモリ効率の高い最適応答近似を提案する。
ハイパーパラメータに関する損失を微分せずにオンラインでハイパーパラメータを更新する Self-Tuning Networks を開発する。
STN は大規模データセットで性能を向上させるハイパーパラメータスケジュールを生み出すことを示す。

提案手法

上位レベルの目的 F と下位レベルの目的 f を用いた二階層問題を定式化し、最適応答 w*(λ) を導入する。
最適応答をパラメトリック関数 φ で近似し、近似的な最適応答を用いて λ を最適化する（Equation 3）。
per-layer の重み/バイアスが Ŵφ(λ)=Welem+(Vλ)⊙rowWhyper および b̂φ(λ)=belem+(Cλ)⊙bhyper（Equation 10）となる、メモリ効率の良い最適応答モジュールを提案する。
Jacobian 正則化の L2 による2層線形ネットワークにおけるゲーティングされた最適応応答の厳密性を主張する（定理 2）。
小さなハイパーパラメータ範囲に対する線形（アファイン）ゲーティング変種を提供し、二次の下位損失の下で正しい Jacobian を保証する（定理 3）。
探索と局所忠実性のバランスを取るエントロピー項を用いたハイパーパラメータ近傍 σ の適応サンプリングを記述する（Equation 15）。

実験結果

リサーチクエスチョン

RQ1勾配ベースのハイパーパラメータ最適化を可能にするために、W*(λ) のコンパクトで微分可能な最適応響 maps を learning できるか。
RQ2オンライン調整されたハイパーパラメータは、大規模なニューラルアーキテクチャ全体で固定ハイパーパラメータを上回るスケジュールを生み出すか。
RQ3このアプローチは、ハイパーパラメータに対する訓練損失を微分せずに、離散的および確率的なハイパーパラメータを扱えるか。
RQ4提案された STN アーキテクチャは深層ネットワークにスケーラブルで、標準ベンチマーク（PTB、CIFAR-10）に実用的か。

主な発見

方法	検証 perplexity	テスト perplexity	検証損失	テスト損失
Grid Search	97.32	94.58	0.794	0.809
Random Search	84.81	81.46	0.921	0.752
Bayesian Optimization	72.13	69.29	0.636	0.651
STN	70.30	67.68	0.575	0.576

STNs はトレーニング中に固定ハイパーパラメータ値を上回るハイパーパラメータスケジュールを発見します。例として PTB および CIFAR-10 の実験で。
グリッド探索、ランダム探索、およびベイズ最適化と比較して、STNs は PTB および CIFAR-10 のタスクで検証/テスト性能をより速く向上させます。
アファインゲーティングを用いた局所近似最適応答は、二次の下位損失に対して有効のまま、正しい勾配情報を保持します（定理 3）。
アファイン最適応応答 architecture (Ŵφ(λ), b̂φ(λ)) はメモリ効率が高く、重みに対して O(Dout(2Din+n)) パラメータ、バイアスに対して O(Dout(2+n)) を要する（Equation 11）。
STNs は解釈可能なハイパーパラメータスケジュールを生成します。例えば、トレーニング中にドロップアウト成分を変化させて、一般化を改善するカリキュラムを形成します。
PTB の場合、STN ベースの LSTM は検証 perplexity 70.30、テスト perplexity 67.68 を達成し、Table 2 でグリッド、ランダム、ベイズ法を上回ります。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。