QUICK REVIEW

[论文解读] Dynamic Evaluation of Neural Sequence Models

Ben Krause, Emmanuel Kahembwe|arXiv (Cornell University)|Sep 21, 2017

Topic Modeling参考文献 32被引用 60

一句话总结

动态评估在测试时通过基于最近历史的梯度更新自适应更新模型参数，在多个语言建模基准上实现了最优的困惑度和跨信息熵表现。

ABSTRACT

We present methodology for using dynamic evaluation to improve neural sequence models. Models are adapted to recent history via a gradient descent based mechanism, causing them to assign higher probabilities to re-occurring sequential patterns. Dynamic evaluation outperforms existing adaptation approaches in our comparisons. Dynamic evaluation improves the state-of-the-art word-level perplexities on the Penn Treebank and WikiText-2 datasets to 51.1 and 44.3 respectively, and the state-of-the-art character-level cross-entropies on the text8 and Hutter Prize datasets to 1.19 bits/char and 1.08 bits/char respectively.

研究动机与目标

Motivate and develop a gradient-based test-time adaptation mechanism to capture local distribution shifts in sequences.
Demonstrate that adapting to recent history improves predictive performance over static models and prior adaptation approaches.
Evaluate the method on word-level and character-level language modelling benchmarks and analyze time-scale effects.
Propose practical improvements to dynamic evaluation to reduce adaptation parameters and computation.

提出的方法

Divide long test sequences into segments and compute gradients on each segment to update adapted parameters.
Initialize adapted parameters theta_l^0 with trained global parameters theta_g.
Apply a gradient-based update using the segment loss L(s_i) to obtain theta_l^i before the next segment.
Introduce a global decay prior lambda*(theta_g - theta_l^{i-1}) to bias adaptation toward training-time parameters.
Replace SGD with RMSprop-style updates using a precomputed MS_g (mean squared gradients) from training data to scale per-parameter updates.
Implement sparse dynamic evaluation by learning a small adaptation matrix M that perturbs hidden states (h'_t = h_t + M h_t) to reduce adaptation parameter count.
Provide and compare multiple update rules, with RMS + RMS global prior performing best in experiments.

实验结果

研究问题

RQ1Does dynamic evaluation improve language modelling performance over static evaluation and previous adaptation approaches?
RQ2What are effective update rules (SGD vs RMSprop, with/without global prior) for dynamic evaluation across word- and character-level tasks?
RQ3How does dynamic evaluation perform across different time-scales and distribution shifts?
RQ4Can adaptation be made computationally efficient (e.g., via sparse dynamic evaluation) without sacrificing performance?

主要发现

Dynamic evaluation improves PTB perplexity to 51.1 on the AWD-LSTM baseline and to 51.6/51.1 on the LSTM variants, outperforming neural cache in these setups.
On WikiText-2, dynamic evaluation achieves 44.3 perplexity, significantly better than related adaptive methods.
Character-level results show dynamic evaluation attaining 1.19 bits/char on text8 and 1.08 bits/char on the Hutter Prize dataset, with sparse dynamic evaluation reaching 1.13 bits/char on Hutter Prize.
Sparse dynamic evaluation uses only 0.5% of adaptation parameters yet yields substantial gains (e.g., 1.13 bits/char on Hutter Prize).
Dynamic evaluation demonstrates notable gains after processing a few hundred characters and can maintain improvements as sequences continue, especially under cross-domain shifts (e.g., Spanish data).
Generated conditional samples from dynamically evaluated models reflect longer-range repetition and local regularities learned during adaptation.]
table_headers:[]
table_rows:[]

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。