QUICK REVIEW

[Paper Review] Multiscale sequence modeling with a learned dictionary

Bart van Merriënboer, Amartya Sanyal|arXiv (Cornell University)|Jul 3, 2017

Natural Language Processing Techniques5 citations

TL;DR

This paper proposes a multiscale sequence model that predicts multi-symbol tokens instead of single characters or words, using a learned dictionary via a BPE-inspired algorithm. By combining the flexibility of character-level models with the efficiency of word-level modeling, the approach achieves improved language modeling performance—especially on smaller models—outperforming standard LSTMs while maintaining tractable likelihood via dynamic programming.

ABSTRACT

We propose a generalization of neural network sequence models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models.

Motivation & Objective

To address the limitations of character-level and word-level sequence models by introducing a hybrid approach that combines their strengths.
To reduce training difficulty caused by long-term dependencies and softmax saturation in RNNs by modeling longer, meaningful subword units.
To maintain the flexibility of character-level models for handling OOV (out-of-vocabulary) words while improving performance through structured tokenization.
To enable efficient, tractable likelihood computation through dynamic programming over multiple possible segmentations.

Proposed method

The model uses a dictionary of multi-symbol tokens, learned via a BPE-like algorithm, to represent sequences at multiple scales.
At each time step, the model predicts over all valid tokens that match the current suffix of the sequence, allowing overlapping and hierarchical predictions.
Hidden states are computed as an average over the RNN outputs for all matching tokens, using the transition function f and embeddings xi.
The likelihood is computed using dynamic programming, marginalizing over all valid segmentations of the sequence.
The model uses an RNN (e.g., LSTM) to maintain context, with the hidden state ht updated based on the most recent tokens in the dictionary.
The likelihood is optimized directly via gradient descent, similar to CTC and forward-backward algorithms, enabling end-to-end training.

Experimental results

Research questions

RQ1Can a sequence model that predicts multi-symbol tokens instead of single symbols achieve better performance than standard character- or word-level models?
RQ2How does the use of a learned, BPE-inspired dictionary affect modeling efficiency and generalization, especially for rare or unseen words?
RQ3Can the model maintain tractable likelihood computation while allowing multiple overlapping token predictions at each step?
RQ4To what extent does the multiscale approach reduce training difficulties associated with long-term dependencies and softmax saturation?
RQ5How does the model’s performance compare to state-of-the-art RNN variants like MI-LSTM and td-LSTM on standard language modeling benchmarks?

Key findings

The proposed multiscale model outperforms standard LSTM language models, particularly on smaller architectures, demonstrating improved sample efficiency.
The model achieves better performance than character-level models by reducing the number of transitions needed to model sequences, thereby mitigating training difficulties.
The use of a BPE-inspired dictionary enables the model to handle OOV words effectively, maintaining the flexibility of character-level models.
The likelihood computation is tractable via dynamic programming, allowing direct optimization and enabling marginalization over all valid segmentations.
The model achieves competitive results on the text8 dataset, with performance approaching state-of-the-art models like HM-LSTM, though not surpassing it.
The approach generalizes well to other architectures, with potential for deeper or more complex RNN variants to further improve performance.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.