QUICK REVIEW

[論文レビュー] Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

Martin Marek, Sanae Lotfi|ArXiv.org|Jul 9, 2025

Topic Modeling被引用数 3

ひとこと要約

要約: 本論文は小さいバッチサイズ（1を含む）でも言語モデルの訓練を安定かつ頑健に行え large-batch や optimizer 重視の手法と同等または上回ることがあることを示し、勾配蓄積に対して推奨しない。

ABSTRACT

Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a proportional increase in batch size. While it is common to decrease the learning rate for smaller batch sizes, other hyperparameters are often held fixed. In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyperparameters to small batch sizes. In particular, rather than holding the decay rate of the second moment fixed across batch sizes, we propose to hold its half-life fixed in terms of tokens. We find that small batch sizes (1) train stably, (2) are consistently more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) notably enable stable language model training with vanilla SGD, even without momentum, despite storing no optimizer state. Building on these results, we provide practical recommendations for selecting a batch size and setting optimizer hyperparameters. We further recommend against gradient accumulation unless training on multiple devices with multiple model replicas. Finally, we show that a small batch size combined with an optimizer with a small state size can provide the performance benefits of full fine-tuning while maintaining a similar memory footprint to LoRA.

研究の動機と目的

事前学習とファインチューニング全体で小さなバッチサイズが言語モデルの訓練を安定化できるかを調査する。
さまざまなバッチサイズで異なるオプティマイザとハイパーパラメータの頑健性を検討する。
Adamのハイパーパラメータのスケーリングとスループットおよびメモリ制約に基づくバッチサイズの選択に関する実用的なガイドラインを開発する。
小さなバッチサイズを選ぶ際のメモリとハードウェアの考慮事項、勾配蓄積との比較を探る。

提案手法

バッチサイズを1〜4096とするSGD、Adam、Adafactor、Muonの網羅的グリッド探索を行う。
各バッチサイズに対して検証損失を最小化するよう学習率とAdamの分解因子（beta1, beta2）を調整する。
二次モーメント半寿命 t2 の概念を導入し、トークンベースの半寿命に基づく beta2 のスケーリング規則を示す。
小バッチ領域でのVanilla SGD（モーメンタムなし）とメモリ効率的なオプティマイザであるAdafactorを比較する。
より大きなモデル（GPT-2 124M、GPT-3 1.3B）やファインチューニングシナリオでスケーリングヒューリスを検証する。
メモリの影響を評価し、メモリ制約のある訓練に対する実用的な推奨を提供する。

実験結果

リサーチクエスチョン

RQ1非常に小さなバッチサイズ（1まで）でモーメンタムや複雑なオプティマイザを使わずに言語モデルを安定して訓練できるか。
RQ2小さなバッチサイズで性能を維持するためにAdamのハイパーパラメータをどのようにスケーリングすべきか。
RQ3小さなバッチサイズは大きなバッチサイズと比べてハイパーパラメータのミススペックに対する頑健性を示すか。
RQ4言語モデル訓練における小さなバッチと勾配蓄積のメモリとハードウェアへの影響は。
RQ5ファインチューニングやより大規模なモデルスケールにもこの知見は適用されるか。

主な発見

適切にスケーリングされたハイパーパラメータの下で、小さなバッチサイズは全てのオプティマイザで大きなバッチと同等以上の性能を一貫して示す。
小さなバッチサイズではモーメントが必須でない場合が多く、Vanilla SGD が競争力を持つ。
第二モーメント半寿命をトークンで固定して beta2 をスケーリングすることは、beta2 を固定するよりもバッチサイズを超えて良い性能を生む。
勾配蓄積はしばしば不要であり、AdafactorやSGDのようなより単純なオプティマイザを小さなバッチで用いる方がメモリ効率が良い。
非常に大きなモデルでは、1バッチのSGDが最小限のチューニングでAdamWのベースラインに匹敵し、Adafactorはメモリと性能の良好なトレードオフを提供する。
推奨実践は、スループットを最大化する最小のバッチサイズを使用し、勾配蓄積はマルチデバイス構成でのみ避けるべきである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。