QUICK REVIEW

[Paper Review] Document Informed Neural Autoregressive Topic Models

Pankaj Gupta, Florian Buettner|arXiv (Cornell University)|Jan 1, 2018

Topic Modeling6 references2 citations

TL;DR

This paper proposes iDocNADE, a neural autoregressive topic model that enhances document and word representations by incorporating full contextual information—both preceding and following words—into a bidirectional language modeling framework. By leveraging left and right context via separate forward and backward hidden layers, iDocNADE achieves improved performance in document perplexity, topic coherence, and downstream NLP tasks such as retrieval and classification, outperforming DocNADE by 9.6% in precision at 2% retrieval fraction and 7.2% in F1 for text categorization across six datasets.

ABSTRACT

Context information around words helps in determining their actual meaning, for example "networks" used in contexts of artificial neural networks or biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. This results in an improved performance in terms of generalization, interpretability and applicability. We apply our modeling approach to seven data sets from various domains and demonstrate that our approach consistently outperforms stateof-the-art generative topic models. With the learned representations, we show on an average a gain of 9.6% (0.57 Vs 0.52) in precision at retrieval fraction 0.02 and 7.2% (0.582 Vs 0.543) in F1 for text categorization.

Motivation & Objective

To address the limitation of existing topic models like DocNADE, which only use left (past) context, by incorporating both left and right (future) context for better word and document representation.
To improve generalization, interpretability, and applicability of neural topic models in downstream NLP tasks such as document retrieval and classification.
To learn more semantically meaningful word and topic representations by modeling the full context around each word in a document.
To demonstrate that bidirectional context modeling leads to superior performance compared to unidirectional models like DocNADE across diverse text domains.

Proposed method

iDocNADE extends DocNADE by introducing two parallel hidden layers: one processing words in forward order (left context) and another in backward order (right context), both conditioned on the full sequence around each word.
For each word vi, the model computes conditional probabilities ppvi|văiq and ppvi|vąiq using separate feed-forward networks with shared parameters across words, enabling joint modeling of left and right context.
The model uses hierarchical softmax via a binary word tree to efficiently compute the conditional probability distribution over the vocabulary, reducing computational complexity.
Word representations are derived from the column vectors W:,vi of the input-to-hidden weight matrix W, providing dense, context-informed embeddings.
The model is trained end-to-end via backpropagation to maximize the log-likelihood of the observed word sequences, optimizing both left and right context modeling.
A bidirectional architecture allows the model to capture long-range dependencies and disambiguate polysemous words (e.g., 'networks' in neuroscience vs. computer science) using full context.

Experimental results

Research questions

RQ1Can incorporating both left and right context in a neural topic model lead to better document representation learning than unidirectional models?
RQ2Does full-context modeling improve topic coherence and interpretability in generated topics?
RQ3To what extent does the bidirectional context modeling in iDocNADE improve performance in document retrieval and text classification compared to DocNADE?
RQ4How does the model generalize to out-of-domain and in-domain transfer learning settings?

Key findings

iDocNADE achieves a 9.6% relative improvement in precision at retrieval fraction 0.02 (0.57 vs. 0.52) compared to DocNADE across six datasets.
The model shows a 7.2% relative gain in F1 score (0.582 vs. 0.543) for text categorization, demonstrating superior applicability in downstream tasks.
iDocNADE achieves lower perplexity than DocNADE on both in-domain (20NewsGroups) and out-of-domain (SiROBs) test sets, indicating better generalization.
Qualitative analysis confirms that topics learned by iDocNADE are more interpretable, with clear semantic clusters such as 'religion' and 'trading' in 20NewsGroups and Reuters21758.
Word representation space learned by iDocNADE shows higher cosine similarity between semantically related words (e.g., 'god' and 'christ') than word2vec, indicating meaningful semantic structure.
Transfer learning experiments show that iDocNADE generalizes better than DocNADE, with lower perplexity on both in-domain and out-of-domain test sets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.