[Paper Review] Dynamic Evaluation of Neural Sequence Models
Dynamic evaluation adaptively updates model parameters at test time using gradient-based updates on recent history, yielding state-of-the-art perplexities and cross-entropies across multiple language modelling benchmarks.
We present methodology for using dynamic evaluation to improve neural sequence models. Models are adapted to recent history via a gradient descent based mechanism, causing them to assign higher probabilities to re-occurring sequential patterns. Dynamic evaluation outperforms existing adaptation approaches in our comparisons. Dynamic evaluation improves the state-of-the-art word-level perplexities on the Penn Treebank and WikiText-2 datasets to 51.1 and 44.3 respectively, and the state-of-the-art character-level cross-entropies on the text8 and Hutter Prize datasets to 1.19 bits/char and 1.08 bits/char respectively.
Motivation & Objective
- Motivate and develop a gradient-based test-time adaptation mechanism to capture local distribution shifts in sequences.
- Demonstrate that adapting to recent history improves predictive performance over static models and prior adaptation approaches.
- Evaluate the method on word-level and character-level language modelling benchmarks and analyze time-scale effects.
- Propose practical improvements to dynamic evaluation to reduce adaptation parameters and computation.
Proposed method
- Divide long test sequences into segments and compute gradients on each segment to update adapted parameters.
- Initialize adapted parameters theta_l^0 with trained global parameters theta_g.
- Apply a gradient-based update using the segment loss L(s_i) to obtain theta_l^i before the next segment.
- Introduce a global decay prior lambda*(theta_g - theta_l^{i-1}) to bias adaptation toward training-time parameters.
- Replace SGD with RMSprop-style updates using a precomputed MS_g (mean squared gradients) from training data to scale per-parameter updates.
- Implement sparse dynamic evaluation by learning a small adaptation matrix M that perturbs hidden states (h'_t = h_t + M h_t) to reduce adaptation parameter count.
- Provide and compare multiple update rules, with RMS + RMS global prior performing best in experiments.
Experimental results
Research questions
- RQ1Does dynamic evaluation improve language modelling performance over static evaluation and previous adaptation approaches?
- RQ2What are effective update rules (SGD vs RMSprop, with/without global prior) for dynamic evaluation across word- and character-level tasks?
- RQ3How does dynamic evaluation perform across different time-scales and distribution shifts?
- RQ4Can adaptation be made computationally efficient (e.g., via sparse dynamic evaluation) without sacrificing performance?
Key findings
- Dynamic evaluation improves PTB perplexity to 51.1 on the AWD-LSTM baseline and to 51.6/51.1 on the LSTM variants, outperforming neural cache in these setups.
- On WikiText-2, dynamic evaluation achieves 44.3 perplexity, significantly better than related adaptive methods.
- Character-level results show dynamic evaluation attaining 1.19 bits/char on text8 and 1.08 bits/char on the Hutter Prize dataset, with sparse dynamic evaluation reaching 1.13 bits/char on Hutter Prize.
- Sparse dynamic evaluation uses only 0.5% of adaptation parameters yet yields substantial gains (e.g., 1.13 bits/char on Hutter Prize).
- Dynamic evaluation demonstrates notable gains after processing a few hundred characters and can maintain improvements as sequences continue, especially under cross-domain shifts (e.g., Spanish data).
- Generated conditional samples from dynamically evaluated models reflect longer-range repetition and local regularities learned during adaptation.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.