QUICK REVIEW

[Paper Review] Dynamic Evaluation of Neural Sequence Models

Ben Krause, Emmanuel Kahembwe|arXiv (Cornell University)|Sep 21, 2017

Topic Modeling32 references60 citations

TL;DR

Dynamic evaluation adaptively updates model parameters at test time using gradient-based updates on recent history, yielding state-of-the-art perplexities and cross-entropies across multiple language modelling benchmarks.

ABSTRACT

We present methodology for using dynamic evaluation to improve neural sequence models. Models are adapted to recent history via a gradient descent based mechanism, causing them to assign higher probabilities to re-occurring sequential patterns. Dynamic evaluation outperforms existing adaptation approaches in our comparisons. Dynamic evaluation improves the state-of-the-art word-level perplexities on the Penn Treebank and WikiText-2 datasets to 51.1 and 44.3 respectively, and the state-of-the-art character-level cross-entropies on the text8 and Hutter Prize datasets to 1.19 bits/char and 1.08 bits/char respectively.

Motivation & Objective

Motivate and develop a gradient-based test-time adaptation mechanism to capture local distribution shifts in sequences.
Demonstrate that adapting to recent history improves predictive performance over static models and prior adaptation approaches.
Evaluate the method on word-level and character-level language modelling benchmarks and analyze time-scale effects.
Propose practical improvements to dynamic evaluation to reduce adaptation parameters and computation.

Proposed method

Divide long test sequences into segments and compute gradients on each segment to update adapted parameters.
Initialize adapted parameters theta_l^0 with trained global parameters theta_g.
Apply a gradient-based update using the segment loss L(s_i) to obtain theta_l^i before the next segment.
Introduce a global decay prior lambda*(theta_g - theta_l^{i-1}) to bias adaptation toward training-time parameters.
Replace SGD with RMSprop-style updates using a precomputed MS_g (mean squared gradients) from training data to scale per-parameter updates.
Implement sparse dynamic evaluation by learning a small adaptation matrix M that perturbs hidden states (h'_t = h_t + M h_t) to reduce adaptation parameter count.
Provide and compare multiple update rules, with RMS + RMS global prior performing best in experiments.

Experimental results

Research questions

RQ1Does dynamic evaluation improve language modelling performance over static evaluation and previous adaptation approaches?
RQ2What are effective update rules (SGD vs RMSprop, with/without global prior) for dynamic evaluation across word- and character-level tasks?
RQ3How does dynamic evaluation perform across different time-scales and distribution shifts?
RQ4Can adaptation be made computationally efficient (e.g., via sparse dynamic evaluation) without sacrificing performance?

Key findings

Dynamic evaluation improves PTB perplexity to 51.1 on the AWD-LSTM baseline and to 51.6/51.1 on the LSTM variants, outperforming neural cache in these setups.
On WikiText-2, dynamic evaluation achieves 44.3 perplexity, significantly better than related adaptive methods.
Character-level results show dynamic evaluation attaining 1.19 bits/char on text8 and 1.08 bits/char on the Hutter Prize dataset, with sparse dynamic evaluation reaching 1.13 bits/char on Hutter Prize.
Sparse dynamic evaluation uses only 0.5% of adaptation parameters yet yields substantial gains (e.g., 1.13 bits/char on Hutter Prize).
Dynamic evaluation demonstrates notable gains after processing a few hundred characters and can maintain improvements as sequences continue, especially under cross-domain shifts (e.g., Spanish data).
Generated conditional samples from dynamically evaluated models reflect longer-range repetition and local regularities learned during adaptation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.