[论文解读] Averaging Weights Leads to Wider Optima and Better Generalization
Stochastic Weight Averaging (SWA) 在 SGD 轨迹上对权重进行平均,使用循环或恒定学习率,通常能获得更好的泛化和更平坦的极值,往往能在单模型下达到或超过 FGE 集成的效果。
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
研究动机与目标
- Motivate the study of loss surface geometry in deep networks and the potential generalization benefits of weight-space averaging.
- Introduce Stochastic Weight Averaging (SWA) as an easy-to-implement modification of SGD.
- Analyze how SWA affects solution width and flatness of optima.
- Empirically evaluate SWA across CIFAR, ImageNet, and multiple architectures, comparing to SGD and FGE ensembles.
提出的方法
- Define SWA as the equally weighted average of multiple SGD-weight proposals collected during training with cyclical or constant learning rates.
- Use cyclic or high-constant learning rate schedules to explore high-performing regions of weight space, then compute w_SWA as the running average of captured weights.
- Optionally perform a final pass to compute batch normalization statistics after SWA weights are used.
- Compare SWA against standard SGD and Fast Geometric Ensembling (FGE) in terms of test accuracy and training loss.
- Demonstrate that SWA finds wider, flatter optima than SGD and approximates FGE with a single model.
实验结果
研究问题
- RQ1Does averaging SGD iterates along cycles or constant learning rate trajectories yield better generalization than standard SGD?
- RQ2Are SWA solutions flatter and wider than those found by SGD, and how does this relate to generalization?
- RQ3Can SWA match or exceed the performance of FGE ensembles while using a single model?
- RQ4How does SWA perform across diverse architectures and datasets (CIFAR-10/100, ImageNet)?
主要发现
- SWA with cyclical or constant learning rates improves test accuracy over conventional SGD across architectures and datasets.
- SWA yields solutions that are wider (flatter) than SGD optima, and averaging moves to a more central region within high-performing weight sets.
- SWA can approximate Fast Geometric Ensembling (FGE) with a single model, offering similar predictive diversity without training multiple models.
- On ImageNet, SWA improves test accuracy by about 0.6–0.9 percentage points over pretrained models across ResNet-50, ResNet-152, and DenseNet-161.
- On CIFAR-100, SWA achieves improvements over SGD of roughly 0.75–1.5 percentage points, while also showing gains on CIFAR-10 and with various architectures.
- SWA provides nearly negligible computational overhead and is easy to implement, with publicly available code.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。