[Paper Review] Improved Variational Autoencoders for Text Modeling using Dilated Convolutions
This paper shows that using a dilated CNN decoder in a VAE for text can outperform standard LSTM language models when decoder contextual capacity is carefully controlled, and demonstrates benefits for semi-supervised classification and unsupervised clustering.
Recent work on generative modeling of text has found that variational auto-encoders (VAE) incorporating LSTM decoders perform worse than simpler LSTM language models (Bowman et al., 2015). This negative result is so far poorly understood, but has been attributed to the propensity of LSTM decoders to ignore conditioning information from the encoder. In this paper, we experiment with a new type of decoder for VAE: a dilated CNN. By changing the decoder's dilation architecture, we control the effective context from previously generated words. In experiments, we find that there is a trade off between the contextual capacity of the decoder and the amount of encoding information used. We show that with the right decoder, VAE can outperform LSTM language models. We demonstrate perplexity gains on two datasets, representing the first positive experimental result on the use VAE for generative modeling of text. Further, we conduct an in-depth investigation of the use of VAE (with our new decoding architecture) for semi-supervised and unsupervised labeling tasks, demonstrating gains over several strong baselines.
Motivation & Objective
- Investigate why textual VAEs with LSTM decoders underperform compared to LSTMs and identify conditions under which VAEs can outperform language models.
- Propose a dilated CNN decoder to flexibly control the contextual capacity available to the decoder.
- Demonstrate language modeling improvements on two datasets and explore semi-supervised and unsupervised text tasks using the proposed decoder.
Proposed method
- Introduce a dilated CNN decoder for VAE to replace the LSTM decoder in text modeling.
- Systematically vary decoder contextual capacity via dilation patterns and network depth to study reliance on latent variables.
- Use an LSTM encoder to produce q(z|x) and a Gaussian prior p(z); concatenate z with decoder inputs.
- Train with variational lower bound and KL annealing to prevent posterior collapse.
- Explore encoder initialization by pretraining as an LSTM language model to boost VAE performance.
- Extend the framework to semi-supervised classification and unsupervised clustering, using Gumbel-Softmax for discrete labels.
Experimental results
Research questions
- RQ1Can a dilated CNN decoder with controllable contextual capacity enable textual VAEs to outperform LSTM language models?
- RQ2How does decoder capacity affect the model’s use of latent representations (KL term) and overall perplexity?
- RQ3Are dilated CNN VAEs beneficial for semi-supervised text classification and unsupervised clustering compared to strong baselines?
Key findings
- A dilated CNN decoder with appropriate contextual capacity enables VAEs to surpass LSTM language models on two datasets.
- Smaller effective contextual windows force the decoder to rely more on the latent variable, increasing KL and improving latent representations.
- Larger decoders reduce reliance on latent variables and diminish VAE gains, with very large decoders performing similarly to pure LM baselines.
- Initializing the VAE encoder with pretrained LSTM language model parameters yields further improvements in NLL and perplexity.
- In semi-supervised settings, certain dilated CNN VAEs (e.g., SCNN-VAE-Semi) achieve higher classification accuracy than baselines, especially with limited labeled data, and encoder initialization boosts performance.
- In unsupervised clustering on Yahoo data, SCNN-VAE with initialization achieves notable gains over baselines using GMM.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.