QUICK REVIEW

[論文レビュー] On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Noah Golmant, Nikita Vemuri|arXiv (Cornell University)|Nov 30, 2018

Advanced Neural Network Applications参考文献 31被引用数 46

ひとこと要約

本論文は、SGD のミニバッチサイズを増やすと収束速度の改善が次第に小さくなり、総計算コストが増加することが多いことを示しており、臨界バッチサイズが現在のGPU容量を大きく下回る一方、多くの領域で大きなバッチに対して性能が低下することを示しています。

ABSTRACT

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for \emph{either} train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.

研究の動機と目的

多様なアーキテクチャやタスクにわたって、ミニバッチサイズとSGDの収束速度の関係を評価する。
バッチサイズスケーリングのレジームを定量化する：線形利得、収益の減少、停滞。
一般的な大バッチ最適化のトリックが問題間の非効率を緩和するかを評価する。
データセットサイズを超える要因（モデルアーキテクチャ、データの複雑さ）が大規模バッチの性能に与える影響を理解する。

提案手法

ミニバッチ勾配を用いたSGDを定式化し、収束までの反復回数を処理時間の代理指標として定義する。
複数のアーキテクチャとタスク（画像分類、セグメンテーション、NLP）で実データを用いてバッチサイズを経験的に変化させる。
基本的な学習率戦略、線形スケーリング規則（LSR）、平方根スケーリング規則（SRSR）を比較する。
固定閾値の損失に到達するまでの反復回数で収束速度を測定し、汎化への影響を評価する。
データセットサイズ、モデルアーキテクチャ、データの複雑さがスピードアップ曲線と臨界バッチサイズに与える影響を分析する。

実験結果

リサーチクエスチョン

RQ1バッチサイズとSGDの収束速度の関係は、さまざまなアーキテクチャやタスクでどうなるか。
RQ2反復数の削減を止める臨界バッチサイズが存在するか、そしてそれがハードウェア容量とどう関連するか。
RQ3大規模バッチ最適化のヒューリスティクス（LSR、SRSR）は、問題を問わず収束遅延や汎化ギャップを緩和するか。
RQ4モデルアーキテクチャとデータの複雑さは、大規模バッチの効率を決定する際にデータセットサイズとどのように比較されるか。

主な発見

Dataset	Task	Architecture	Training Strategy	BS range
MNIST	IC	ResNet34	BLR, LSR ( η0=0.1, W=10, E=200)	2^6 – 2^14
CIFAR-10	IC	AlexNet, MobileNetV2	BLR, LSR, SRSR	2^6 – 2^14
ResNet34, VGG16	IC	( η0=0.1, W=10, E=200)	BLR, LSR	2^6 – 2^14
CIFAR-100	IC	ResNet34	BLR, LSR ( η0=0.1, W=10, E=200)	2^6 – 2^14
SVHN	IC	ResNet34	BLR, LSR ( η0=0.1, W=10, E=200)	2^6 – 2^14
WikiText-2	NLP	LSTM	BLR, LSR ( η0=20, W=3, E=40)	2^3 – 2^10
Cityscapes	IS	DRN-D-22	BLR, LSR ( η0=0.01, W=10, E=100)	2^3 – 2^11

ある閾値を超えるバッチサイズでは、mを増やしても収束までの反復回数の削減がほとんど見られない（完璧な並列化でも同様）。
より大きなバッチは汎化誤差を引き上げ、既存の緩和技術は多くの場合機能しない、あるいは発散することが多く、特に非画像領域で顕著。
収束速度の利得はデータセットサイズよりもモデルアーキテクチャとデータの複雑さにより依存し、問題依存の臨界バッチサイズが観測される。
画像、セグメンテーション、NLPタスク全体でスピードアップの収益が逓減し、プラトー点はアーキテクチャとデータセットの複雑さによって異なる。
特定の問題で有効な大規模バッチ戦略は、ドメインを越えて一般化せず、安定性を維持できないことが多い。）

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。