Skip to main content
QUICK REVIEW

[Paper Review] Learning Topic Models - Going beyond SVD

Sanjeev Arora, Rong Ge|arXiv (Cornell University)|Apr 9, 2012
Topic Modeling21 references58 citations
TL;DR

This paper proposes a polynomial-time algorithm for learning topic models using Nonnegative Matrix Factorization (NMF) instead of Singular Value Decomposition (SVD), overcoming SVD's limitations of requiring pure documents or only recovering topic spans. The key contribution is a provably correct algorithm under the separability assumption, which generalizes to correlated topic models like CTM and PAM.

ABSTRACT

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model. We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD - just as NMF has come to replace SVD in many applications.

Motivation & Objective

  • Address the limitations of SVD-based methods in topic modeling, which either require pure documents (one topic per document) or only recover the span of topic vectors.
  • Develop a provable, polynomial-time algorithm for learning topic models that recovers the actual topic vectors, not just their span.
  • Justify Nonnegative Matrix Factorization (NMF) as a superior alternative to SVD in topic modeling by leveraging the nonnegativity of word-topic and document-topic matrices.
  • Generalize the algorithm to handle topic-topic correlations, such as in the Correlated Topic Model (CTM) and Pachinko Allocation Model (PAM).
  • Demonstrate that even under the separability assumption, maximum likelihood estimation (MLE) for topic models remains NP-hard, highlighting the need for efficient approximation algorithms.

Proposed method

  • Use Nonnegative Matrix Factorization (NMF) to decompose the document-word matrix into nonnegative factors representing topic vectors and document-topic distributions.
  • Leverage the separability assumption—where each topic has at least one unique word (anchor word)—to enable efficient and provable recovery of topic vectors.
  • Apply a greedy algorithm that identifies anchor words and uses them to iteratively recover topic vectors, ensuring convergence in polynomial time.
  • Generalize the framework to models with topic correlations by extending the NMF-based recovery to handle structured priors on document-topic distributions.
  • Prove that the algorithm recovers the true topic matrix and document-topic parameters under mild assumptions, with error bounds depending on sampling and noise levels.
  • Use a reduction from the minimum bisection problem to show that MLE for topic models is NP-hard even under separability, establishing theoretical hardness boundaries.

Experimental results

Research questions

  • RQ1Can topic models be learned in polynomial time without requiring pure documents or only recovering the span of topics?
  • RQ2Is Nonnegative Matrix Factorization (NMF) a viable and provably correct alternative to SVD for topic modeling under realistic assumptions?
  • RQ3Does the separability assumption—where each topic has at least one unique word—enable efficient and accurate recovery of topic vectors?
  • RQ4Can the proposed NMF-based algorithm be extended to more complex topic models that incorporate topic-topic correlations, such as CTM and PAM?
  • RQ5Is maximum likelihood estimation (MLE) for topic models NP-hard even when the topic matrix is separable?

Key findings

  • The proposed NMF-based algorithm runs in polynomial time and recovers the true topic vectors under the separability assumption, unlike SVD-based methods that only recover the topic span.
  • The algorithm generalizes to correlated topic models such as CTM and PAM, enabling efficient learning in more realistic modeling settings.
  • The paper proves that even under the separability assumption, maximum likelihood estimation (MLE) for topic models is NP-hard, via a reduction from the minimum bisection problem.
  • The objective function of the MLE problem is shown to be maximized by canonical solutions corresponding to minimum bisections, with a gap of at least log 2 between optimal and suboptimal solutions.
  • The algorithm's robustness is demonstrated by showing that any deviation from canonical solutions (e.g., non-uniform topic weights) leads to a significant drop in objective value, ensuring convergence to the correct solution.
  • Theoretical analysis confirms that the algorithm’s performance is stable under sampling noise, with error bounds derived from concentration inequalities and Taylor expansions on concave functions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.