QUICK REVIEW

[論文レビュー] N-gram Language Modeling using Recurrent Neural Network Estimation

Ciprian Chelba, Mohammad Norouzi|arXiv (Cornell University)|Mar 31, 2017

Topic Modeling参考文献 6被引用数 33

ひとこと要約

この論文では、Kneser-Neyのような従来のバックオフ手法に代わって、LSTMベースのニューラルネットワークを用いてn-gram言語モデルの平滑化を行う手法を提案している。LSTMは長距離依存関係を効果的に捉えることができ、n=13で完全な再帰的LSTMと同等の性能を達成し、n-gram順序が高くなるにつれて困惑度が向上する。これは古典的な平滑化手法を上回る。

ABSTRACT

We investigate the effective memory depth of RNN models by using them for $n$-gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the $n$-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM $n$-gram matches the LSTM LM performance for $n=9$ and slightly outperforms it for $n=13$. When allowing dependencies across sentence boundaries, the LSTM $13$-gram almost matches the perplexity of the unlimited history LSTM LM. LSTM $n$-gram smoothing also has the desirable property of improving with increasing $n$-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low $n$-gram orders. Experiments on the One Billion Words benchmark show that the results hold at larger scale: while LSTM smoothing for short $n$-gram contexts does not provide significant advantages over classic N-gram models, it becomes effective with long contexts ($n > 5$); depending on the task and amount of data it can match fully recurrent LSTM models at about $n=13$. This may have implications when modeling short-format text, e.g. voice search/query LMs. Building LSTM $n$-gram LMs may be appealing for some practical situations: the state in a $n$-gram LM can be succinctly represented with $(n-1)*4$ bytes storing the identity of the words in the context and batches of $n$-gram contexts can be processed in parallel. On the downside, the $n$-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM.

研究の動機と目的

RNNモデルの有効な記憶深さを、n-gram言語モデルの平滑化に応用することで調査すること。
LSTMベースのモデルが、KatzやKneser-Neyのような従来のn-gram平滑化手法に比べ、困惑度とスケーラビリティの面で優れているかどうかを評価すること。
LSTMを用いてn-gramコンテキストを符号化する際の、学習効率、推論速度、モデル性能のトレードオフを調査すること。
古典的バックオフ手法とは異なり、n-gram順序が増加するにつれて神経平滑化が効果的に向上するかどうかを検証すること。
音声検索のような低リソースまたは短いシーケンスの応用において、LSTM n-gramモデルの実用的妥当性を評価すること。

提案手法

LSTMネットワークを、従来のn-gram確率推定に代わって、固定長のn-gramコンテキストから次の単語を予測するように訓練する。
LSTMはn-gramコンテキスト内の各単語の埋め込みを逐次処理し、コンテキスト履歴を符号化する隠れ状態を維持する。
一般化を向上させ、過学習を軽減するために、LSTMセルにドロップアウトを適用する。
学習効率を向上させるために、ワンホットベクトルの代わりにマルチノミアルターゲット（ソフトラベル）を用いて学習する。
推論時、各n-gramコンテキストに対してLSTM状態を一度計算し、語の識別子を表す4*(n-1)-バイトのコンactな表現として保存する。
実験では、文の独立性（<S>でリセット）と、文境界を越えてコンテキストを共有する2つの設定を比較する。

実験結果

リサーチクエスチョン

RQ1n-gram言語モデルの平滑化に用いたLSTMの有効な記憶深さは何か？
RQ2Kneser-Ney や Katz 平滑化といった古典的手法と比較して、LSTMベースのn-gram平滑化は性能でどう異なるか？
RQ3従来の平滑化手法とは異なり、n-gram順序が増加するにつれてLSTM平滑化n-gramモデルの性能が向上するか？
RQ4LSTM n-gramモデルは、完全な再帰的LSTM言語モデルに近い性能を達成できるか？また、どのn-gram順序で達成されるか？
RQ5標準的な再帰的LSTMと比較して、LSTMベースのn-gramモデルを用いる際の学習および推論効率のトレードオフは何か？

主な発見

ドロップアウトを適用したLSTMは、フィードフォワードおよびヴァナラRNNモデルよりもn-gram状態の符号化において優れており、UPenn Treebankで最も低い困惑度を達成した。
n=9の場合、LSTM n-gramは文の独立性下で完全な再帰的LSTM LMと同等の性能を示した。n=13では、わずかにそれを上回った。
文境界を越えてコンテキストを共有する設定では、LSTM 13-gramはOne Billion Wordsベンチマークで困惑度49を達成し、完全な再帰的LSTM LM（48）にほぼ等しい性能を示した。
LSTM n-gramモデルはn-gram順序が高くなるにつれて一貫した改善を示したが、Kneser-Ney や Katz バックオフとは異なり、低いnで飽和した。
One Billion Wordsベンチマークでは、n>5でLSTM平滑化が有効になり、n≈13で完全な再帰的LSTMの性能に達した。
ワンホットラベルの代わりにマルチノミアルターゲットを用いることで、特に高次のn-gramではわずかな向上しか得られなかった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。