Skip to main content
QUICK REVIEW

[Paper Review] Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Sergey Ioffe|arXiv (Cornell University)|Feb 10, 2017
Machine Learning and Data Classification10 references244 citations
TL;DR

Batch Renormalization extends Batch Normalization to reduce minibatch dependence, enabling stable training with small or non-i.i.d. minibatches while preserving training efficiency and other BN benefits. It introduces a per-dimension affine correction (r, d) computed from minibatches but treated as constants during backpropagation, gradually relaxed during training.

ABSTRACT

Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d. minibatches. At the same time, Batch Renormalization retains the benefits of batchnorm such as insensitivity to initialization and training efficiency.

Motivation & Objective

  • Motivate and address the drawbacks of Batch Normalization when minibatches are small or non-i.i.d.
  • Develop an extension that makes training activations depend on individual examples similar to inference
  • Maintain BN advantages (training speed, initialization robustness) while aligning training and inference activations
  • Provide a practical, easy-to-implement method with tunable correction bounds and moving-average updates

Proposed method

  • Introduce per-dimension correction factors r and d to Batch Normalization activations, treated as constants during gradient computation
  • Compute r and d from minibatch statistics but clamp them with r_max and d_max and apply stop_gradient to their values
  • Use moving averages mu and sigma during training for correction, with a higher update rate alpha to keep statistics current
  • Gradually relax the correction bounds during training to transition from BN to Renorm
  • Provide explicit backpropagation equations for x, y, mu, sigma, r, d, gamma, beta
  • Outline an algorithm that updates mu and sigma and applies the renormalization in forward and backward passes

Experimental results

Research questions

  • RQ1Can Batch Renormalization reduce the mismatch between training and inference activations observed with Batch Normalization on small or non-i.i.d. minibatches?
  • RQ2Does Batch Renormalization retain Batch Normalization benefits (training speed, initialization insensitivity) while improving performance on challenging minibatch regimes?
  • RQ3How should the correction bounds (r_max, d_max) and moving-average update rate (alpha) be scheduled for stable training?
  • RQ4Is Batch Renormalization effective across architectures and tasks where BN is typically used (e.g., image classification with Inception/V3)?

Key findings

  • Batch Renormalization achieves comparable or modestly higher validation accuracy than Batch Normalization on ImageNet with Inception-v3 when using minibatch size 32 across 50 workers (78.3% baseline BN vs 78.5% with Renorm)
  • With microbatches of 4 (small minibatches), Batch Renorm trains faster and attains higher accuracy (76.5% at 130k steps) than BatchNorm (74.2% at 210k steps)
  • On non-i.i.d. minibatches sampled by labels, BatchNorm collapses performance while Batch Renorm recovers to baseline-like accuracy (78.5% at 120k steps)
  • Batch Renormalization eliminates the overfitting to biased minibatch distributions seen with BatchNorm in metric-learning like minibatch setups
  • The method remains easy to implement, runs with similar speed to BN, and introduces hyperparameters (alpha, r_max, d_max) with a schedule for relaxing correction during training

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.