QUICK REVIEW

[論文レビュー] Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, D. A. Podoprikhin|arXiv (Cornell University)|Mar 14, 2018

Advanced Neural Network Applications参考文献 23被引用数 227

ひとこと要約

Stochastic Weight Averaging (SWA) は、cyclical または constant の学習率を用いた SGD の軌跡に沿ってウェイトを平均化し、より良い一般化と平坦な最適解をもたらし、しばしば単一モデルで FGE アンサンブルと同等以上を達成する。

ABSTRACT

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

研究の動機と目的

深層ネットワークにおける損失面幾何の研究と、ウェイト空間の平均化の潜在的な一般化効果を動機づける。
Stochastic Weight Averaging (SWA) を SGD の実装が容易な修正として導入する。
SWA が解の幅と最適解のフラットさに与える影響を分析する。
CIFAR、ImageNet、および複数のアーキテクチャにわたって実証的に SWA を評価し、SGD および FGE アンサンブルと比較する。

提案手法

SWA を、cyclical または constant の学習率で訓練中に収集された複数の SGD 重み提案の等しく重み付けされた平均として定義する。
高性能な重み空間領域を探索するために cyclic または高定数の学習率スケジュールを用い、捕捉された重みの走行平均として w_SWA を計算する。
SWA 重みを用いた後、バッチ正規化統計を計算する最終パスを任意で実行する。
テスト精度と訓練損失の観点から、SWA を標準 SGD および Fast Geometric Ensembling (FGE) と比較する。
SWA が SGD よりも広く平坦な最適解を見つけ、単一モデルで FGE を近似することを示す。

実験結果

リサーチクエスチョン

RQ1サイクルまたは一定学習率の軌跡に沿って SGD のイテレーションを平均化することは、標準 SGD よりもより良い一般化をもたらすか。
RQ2SWA の解は SGD によって見つかった解よりも平坦で広いか、これが一般化とどのように関連するか。
RQ3単一モデルを使用しつつ、SWA が FGE アンサンブルの性能に匹敵するか、またはそれを上回るか。
RQ4多様なアーキテクチャとデータセット（CIFAR-10/100、ImageNet）における SWA の性能はどうなるか。

主な発見

SWA with cyclical or constant learning rates improves test accuracy over conventional SGD across architectures and datasets.
SWA yields solutions that are wider (flatter) than SGD optima, and averaging moves to a more central region within high-performing weight sets.
SWA can approximate Fast Geometric Ensembling (FGE) with a single model, offering similar predictive diversity without training multiple models.
On ImageNet, SWA improves test accuracy by about 0.6–0.9 percentage points over pretrained models across ResNet-50, ResNet-152, and DenseNet-161.
On CIFAR-100, SWA achieves improvements over SGD of roughly 0.75–1.5 percentage points, while also showing gains on CIFAR-10 and with various architectures.
SWA provides nearly negligible computational overhead and is easy to implement, with publicly available code.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。