QUICK REVIEW

[Paper Review] Adaptively Truncating Backpropagation Through Time to Control Gradient Bias

Christopher Aicher, Nicholas J. Foti|arXiv (Cornell University)|May 17, 2019

Sparse and Compressive Sensing Techniques22 references21 citations

TL;DR

This paper proposes an adaptive truncation scheme for truncated backpropagation through time (TBPTT) in recurrent neural networks, dynamically adjusting the truncation length based on estimated gradient bias rather than using a fixed lag. By assuming geometric decay of gradients in expectation, the method controls relative bias and ensures non-asymptotic convergence of SGD, outperforming fixed-K TBPTT in language modeling while maintaining bias control.

ABSTRACT

Truncated backpropagation through time (TBPTT) is a popular method for learning in recurrent neural networks (RNNs) that saves computation and memory at the cost of bias by truncating backpropagation after a fixed number of lags. In practice, choosing the optimal truncation length is difficult: TBPTT will not converge if the truncation length is too small, or will converge slowly if it is too large. We propose an adaptive TBPTT scheme that converts the problem from choosing a temporal lag to one of choosing a tolerable amount of gradient bias. For many realistic RNNs, the TBPTT gradients decay geometrically in expectation for large lags; under this condition, we can control the bias by varying the truncation length adaptively. For RNNs with smooth activation functions, we prove that this bias controls the convergence rate of SGD with biased gradients for our non-convex loss. Using this theory, we develop a practical method for adaptively estimating the truncation length during training. We evaluate our adaptive TBPTT method on synthetic data and language modeling tasks and find that our adaptive TBPTT ameliorates the computational pitfalls of fixed TBPTT.

Motivation & Objective

To address the challenge of selecting an optimal fixed truncation length in TBPTT, which can lead to slow convergence or non-convergence due to gradient bias.
To formalize a condition under which gradient bias in TBPTT decays geometrically, enabling bias control through adaptive truncation.
To develop a practical algorithm that estimates gradient bias in real time during training and adjusts truncation length accordingly.
To prove non-asymptotic convergence rates for SGD when using biased gradients under bounded relative bias.
To empirically validate the method on synthetic tasks and language modeling benchmarks, showing competitive performance with bias control.

Proposed method

Proposes a theoretical framework where gradient norms decay geometrically in expectation beyond a certain lag, enabling bias control.
Introduces a relative bias measure δ that quantifies the ratio of biased to exact gradients, with δ < 1 ensuring convergence.
Develops an estimator for the relative bias δ using minibatch gradients during training, enabling real-time adaptation.
Designs an adaptive TBPTT algorithm (Algorithm 1) that adjusts truncation length K based on estimated δ and user-defined target bias levels.
Uses a Mahalanobis-type norm or weighted norm to improve bias estimation in high-dimensional hidden states, though this is left for future work.
Applies the method to both synthetic copy tasks and real-world language modeling (PTB, Wiki2), using LSTMs with fixed hyperparameters.

Experimental results

Research questions

RQ1Can gradient bias in TBPTT be controlled by adapting the truncation length based on estimated bias rather than using a fixed lag?
RQ2Under what conditions does the gradient norm decay geometrically in expectation, enabling bias control in TBPTT?
RQ3Does adaptive truncation based on relative bias estimation lead to faster convergence and better performance than fixed truncation in RNN training?
RQ4Can non-asymptotic convergence guarantees be established for SGD when using biased gradients under bounded relative bias?
RQ5How does the method perform in practice on real-world language modeling tasks compared to optimal fixed-K TBPTT?

Key findings

The proposed adaptive TBPTT method controls gradient bias effectively, while fixed-K TBPTT fails to maintain bias control across training.
On both synthetic copy tasks and language modeling (PTB and Wiki2), the adaptive method achieves test perplexity comparable to or better than the best fixed-K TBPTT configurations.
The estimated truncation length K stabilizes quickly to a constant value during training, indicating effective adaptation.
Empirical results confirm that gradient norms decay geometrically in expectation (as assumed), even when individual gradients are noisy.
In high-dimensional settings, the Euclidean norm can lead to overly conservative bias estimates; future work should consider dimensionally weighted norms like Mahalanobis.
Theoretical analysis shows that SGD with biased gradients converges at a rate of (1−δ)−1 relative to unbiased SGD when δ<1, providing a convergence guarantee.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.