QUICK REVIEW

[論文レビュー] Enriching Word Vectors with Subword Information

Piotr Bojanowski, Édouard Grave|arXiv (Cornell University)|Jul 15, 2016

Topic Modeling参考文献 35被引用数 438

ひとこと要約

サブワード対応の単語埋め込みモデルを導入し、単語をハッシュ化した文字n-gramベクトルの総和として表現することで、未知語の表現を可能にし、形態的に豊かな言語の性能を向上させる。

ABSTRACT

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$-grams. A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

研究の動機と目的

従来の単語埋め込みが語形素性を無視することの制限に対処する。
サブワード情報を活用して単語間でパラメータを共有し、稀少/未登場語をよりよく表現する。
言語とタスクを跨いで評価し、形態素的に豊かな言語への利点を示す。

提案手法

サブワード情報を組み込むためにskip-gram with negative samplingを拡張。
各単語を境界記号付きの文字n-gramの袋として表現し、それらのベクトルを加算して単語表現を形成。
各n-gramにベクトルを割り当て、負のサンプリングを用いた SGD で学習。
メモリを制限するためn-gramを固定集合のベクトルに写像するハッシュを使用。
9言語の大規模Wikipediaコーパスで訓練・評価を行い、OOV処理はn-gramベクトルの総和で行う。

実験結果

リサーチクエスチョン

RQ1文字n-gramのサブワード情報を組み込むことで、言語を超えた語相似度・アナロジー性能は改善されるか？
RQ2サブワードベースの単語表現は、形態素対応ベースの基準法や従来のサブワード手法とどう比較されるか？
RQ3OOV語はn-gramベクトルの総和によって効果的に表現できるか、そしてこれが下流タスクにどのように影響するか？
RQ4訓練データ量とn-gramの範囲が、特に形態素的に豊かな言語で性能にどのように影響するか？

主な発見

サブワード強化ベクトル（sisg）はほとんどの語相似度データセットでベースラインを上回り、OOV語の取り扱いを改善する。
このアプローチは統語的アナロジーの性能を高め、ドイツ語やチェコ語のような形態素に富む言語で顕著な向上を示す。
形態素ベースの手法と比べて、単純なn-gramの総和表現は競争力があり、多くの言語で特に複合語や豊富な語形変化を持つ言語で優れていることが多い。
訓練データが限られている場合でも性能が頑健であり、低リソース設定での実用的な利点を示す。
長いシーケンスを含むn-gram範囲を拡張すると意味的アナロジーの性能が向上する場合があり、言語間でトレードオフがある。
言語モデリング実験ではサブワード対応ベクトルで初期化した場合、特にスラブ系言語でパープレキシティが低下することを示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。