QUICK REVIEW

[論文レビュー] Keyword Assisted Topic Models

Shusei Eshima, Kosuke Imai|arXiv (Cornell University)|Apr 13, 2020

Computational and Text Analysis Methods参考文献 52被引用数 33

ひとこと要約

論文は keyATM を導入する。半教師付きトピックモデルで、少数のキーワードを用いて解釈可能性と測定を改善し、キーワードなしトピック、共変量、時系列傾向をサポートする。

ABSTRACT

In recent years, fully automated content analysis based on probabilistic topic models has become popular among social scientists because of their scalability. The unsupervised nature of the models makes them suitable for exploring topics in a corpus without prior knowledge. However, researchers find that these models often fail to measure specific concepts of substantive interest by inadvertently creating multiple topics with similar content and combining distinct themes into a single topic. In this paper, we empirically demonstrate that providing a small number of keywords can substantially enhance the measurement performance of topic models. An important advantage of the proposed keyword assisted topic model (keyATM) is that the specification of keywords requires researchers to label topics prior to fitting a model to the data. This contrasts with a widespread practice of post-hoc topic interpretation and adjustments that compromises the objectivity of empirical findings. In our application, we find that keyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than the standard topic models. Finally, we show that keyATM can also incorporate covariates and model time trends. An open-source software package is available for implementing the proposed methodology.

研究の動機と目的

トピックモデルを用いた自動コンテンツ解析における測定の改善の必要性を動機づける。
各トピックにつき少数のキーワードを組み込む半教師付きトピックモデル (keyATM) を提案する。
基底モデルを拡張してキーワードを持たないトピックを許容し、文書の共変量と時系列傾向をモデル化する。
キーワードの組み込みが、無監視ベースラインよりも解釈可能なトピックと分類性能を提供することを示す。

提案手法

K-トピックモデル内にキーワード付きトピックとキーワードなしトピックという2種類のトピック型構造を定義する。
キーワード付きトピックについて、語がキーワード由来か標準のトピック語分布から来るかを決定する Bernoulli s_di を導入する。
トピック語分布とキーワード語分布のディリクレ事前分布を用い、キーワード確率パラメータ pi_k に Beta 事前分布を置く。
θ, φ, tilde_φ, pi を周辺化して z_di, s_di, alpha_k をサンプルする崩壊ギブスサンプリング方式を採用する。
サンプリング時のカウントを過度に頻出する語で下げるために、語の重み付け (wLDA) を導入する。
φ*_kv および θ_dk の事後推定の閉形式表現を提供し、キーワード成分と非キーワード成分の解釈を論じる。

実験結果

リサーチクエスチョン

RQ1トピックに対して少数のキーワードを組み込むことは、無監視型トピックモデルと比較してトピックの解釈性を高めるか。
RQ2keyATM は標準的な LDA ベースのモデルと比較して文書分類性能を改善するか。
RQ3キーワードを持たないトピックを許容し、共変量/時系列をモデル化しても性能を損なわないか。

主な発見

keyATM はキーワード非意識型ベースライン (wLDA) よりも解釈可能なトピック語分布を生成する。
keyATM のトピック語分布は、人間がコード化したラベルおよび CAP/CBP 分類とより良く整合する。
keyATM は、議会の法案コーパスのほとんどのトピックで wLDA より文書-トピック分類性能を改善し、ROC の比較では keyATM が優れていた。
キーワードなしトピックを許容しハイパーパラメータを学習することで、モデルの柔軟性と性能が向上する。
基盤となる keyATM は共変量を組み込み、時系列傾向をモデリングしつつ、解釈性と測定品質を改善した状態を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。