QUICK REVIEW

[论文解读] textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

Pankaj Gupta, Yatin Chaudhary|arXiv (Cornell University)|Oct 9, 2018

Topic Modeling参考文献 33被引用 2

一句话总结

本文提出 ctx-DocNADE 和 ctx-DocNADEe，这两种神经自回归主题模型通过整合基于 LSTM 的语言模型与词嵌入，以捕捉词序、句法、语义及长距离依赖关系，克服了传统主题模型的词袋假设局限。这些模型在困惑度、主题连贯性以及检索与分类任务中的表现显著提升，尤其在短文本或稀疏文本数据集上表现更优。

ABSTRACT

We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.

研究动机与目标

为解决传统主题模型忽略词序与语言结构的问题，通过引入神经语言建模来改进。
通过整合预训练词嵌入作为外部知识先验，提升在稀疏或短文本设置下的主题模型性能。
在单一概率框架中统一神经自回归主题建模与上下文语言建模，以增强语义表征能力。
在多种长文本与短文本数据集上，通过检索、分类与连贯性指标评估所提出模型的性能。

提出的方法

将基于 DocNADE 的神经自回归主题模型与基于 LSTM 的语言模型（LSTM-LM）相结合，以联合建模局部共现模式与全局文档级语义。
利用 LSTM-LM 的隐藏状态来调节主题模型中的词概率估计，实现上下文感知的词生成。
将预训练词嵌入作为输入先验，以提升在低资源设置下词与主题的映射质量。
通过将词嵌入用作组合先验，将框架扩展为 ctx-DocNADEe，以增强在稀疏或短文本语料上的泛化能力。
采用统一的概率框架，通过最大似然估计联合优化主题分配与语言建模。
利用 LSTM-LM 的分层表示，其中低层捕捉句法特征，高层捕捉语义信息，以丰富主题建模。

实验结果

研究问题

RQ1将神经语言模型整合到主题模型中，是否能通过捕捉词序与语义结构，改善 P(词|上下文) 的估计？
RQ2将词嵌入作为先验引入后，是否能提升在低资源或短文本设置下的主题模型性能？
RQ3将主题模型与上下文语言模型结合，是否能通过困惑度指标实现优于当前最先进模型的泛化能力？
RQ4所提出模型在多大程度上提升了主题的可解释性以及下游 NLP 任务（如检索与分类）的性能？
RQ5当仅使用少量训练数据时，模型是否仍能保持性能优势？

主要发现

在 TMNtitle 数据集上，ctx-DocNADEe 在 20% 训练数据比例下达到 0.580 的 IR 精确率，优于 DocNADE 的 0.444。
在同一数据集上，ctx-DocNADEe 在 20% 训练数据下取得 0.711 的宏 F1 分数，高于 DocNADE 的 0.615。
在 100% 训练数据下，ctx-DocNADEe 达到 0.595 的 IR 精确率与 0.726 的宏 F1 分数，分别超过 DocNADE 的 0.525 与 0.688。
该模型提升了主题连贯性与可解释性，ctx-DocNADEe 在 20NS 数据集上提取的主题比 DocNADE 更具连贯性。
在文本检索任务中，ctx-DocNADEe 能够检索到与查询无一元词重叠的相关文档，展现出强大的语义泛化能力。
在 7 个长文本与 8 个短文本数据集上，所提出模型在困惑度、连贯性、检索与分类任务中，始终优于当前最先进生成式主题模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。