QUICK REVIEW

[論文レビュー] Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings

Rie Johnson, Tong Zhang|arXiv (Cornell University)|Feb 7, 2016

Text and Document Classification Technologies参考文献 23被引用数 136

ひとこと要約

この論文は、テキスト分類のための一般的な region- embedding フレームワークを提案し、one-hot LSTM を用いてテキスト領域を埋め込み、LSTM ベースの領域埋め込みと CNN ベースの領域埋め込みを unlabeled data で訓練して結合することで、複数のベンチマークにおいて最先端の結果を得ることを示す。

ABSTRACT

One-hot CNN (convolutional neural network) has been shown to be effective for text categorization (Johnson & Zhang, 2015). We view it as a special case of a general framework which jointly trains a linear model with a non-linear feature generator consisting of `text region embedding + pooling'. Under this framework, we explore a more sophisticated region embedding method using Long Short-Term Memory (LSTM). LSTM can embed text regions of variable (and possibly large) sizes, whereas the region size needs to be fixed in a CNN. We seek effective and efficient use of LSTM for this purpose in the supervised and semi-supervised settings. The best results were obtained by combining region embeddings in the form of LSTM and convolution layers trained on unlabeled data. The results indicate that on this task, embeddings of text regions, which can convey complex concepts, are more useful than embeddings of single words in isolation. We report performances exceeding the previous best results on four benchmark datasets.

研究の動機と目的

Introduce a general framework of region embedding + pooling for text categorization that subsumes one-hot CNN.
Investigate using Long Short-Term Memory (LSTM) as a region-embedding generator without word embeddings.
Assess supervised and semi-supervised settings, including unlabeled data to learn region embeddings.
Demonstrate whether combining LSTM-based embeddings with CNN-based embeddings trained on unlabeled data improves performance.

提案手法

Replace word embedding layer with one-hot LSTM to feed one-hot vectors directly into LSTM.
Use pooling to form document representations from region embeddings (short segments) rather than full documents.
Simplify LSTM by removing input/output gates and applying chopping for speed, and optionally employ bidirectional LSTM with pooling.
Introduce LSTM tv-embeddings learned from unlabeled data to provide additional inputs to the supervised LSTM model.
Combine LSTM tv-embeddings with CNN tv-embeddings to create complementary region representations.
Train end-to-end with SGD (or RMSProp) on labeled data; evaluate on four benchmarks and compare against SVM, oh-CNN, and wv-LSTM.

実験結果

リサーチクエスチョン

RQ1Can region embeddings learned by one-hot LSTM outperform fixed-size region embeddings in CNNs for text categorization?
RQ2Does incorporating unlabeled data via tv-embeddings improve supervised text classification?
RQ3Do combinations of LSTM and CNN region embeddings yield complementary benefits and better performance than either alone?

主な発見

Method	IMDB	Elec	RCV1	20NG
SVM bow	11.36	11.71	10.76	17.47
SVM 1–3grams	9.42	8.71	10.69	15.85
wv-LSTM [DL15]	13.50	11.74	16.04	18.0
oh-2LSTMp	8.14	7.33	11.17	13.32
oh-CNN [JZ15b]	8.39	7.64	9.17	13.64

One-hot bidirectional LSTM with pooling (oh-2LSTMp) outperforms word-vector LSTM (wv-LSTM) on IMDB, Elec, and 20NG, and is competitive with or better than oh-CNN in most cases.
In supervised settings, oh-2LSTMp achieves lower error than several baselines, with Table 3 showing: IMDB 8.14, Elec 7.33, RCV1 11.17, 20NG 13.32.
Semi-supervised results show that oh-2LSTMp with LSTM tv-embeddings trained on unlabeled data improves performance over the supervised version on all datasets (e.g., IMDB 6.66 vs 8.14).
oh-CNN + CNN tv-embeddings and oh-2LSTMp + LSTM tv-embeddings can further improve results when combined (Table 6).
The best reported supervised result on IMDB/Elec/RCV1/20NG surpassed prior bests, e.g., IMDB 5.94, Elec 5.55, RCV1 7.15 in Table 7.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。