QUICK REVIEW

[Paper Review] Deep Variational Canonical Correlation Analysis

Weiran Wang, Yan, Xinchen|arXiv (Cornell University)|Oct 11, 2016

Face and Expression Recognition35 references99 citations

TL;DR

Introduces Deep Variational CCA (VCCA) and VCCA-private, probabilistic multi-view models using neural networks to learn shared latent representations and disentangle private view-specific information, with tractable variational training and sample generation.

ABSTRACT

We present deep variational canonical correlation analysis (VCCA), a deep multi-view learning model that extends the latent variable model interpretation of linear CCA to nonlinear observation models parameterized by deep neural networks. We derive variational lower bounds of the data likelihood by parameterizing the posterior probability of the latent variables from the view that is available at test time. We also propose a variant of VCCA called VCCA-private that can, in addition to the "common variables" underlying both views, extract the "private variables" within each view, and disentangles the shared and private information for multi-view data without hard supervision. Experimental results on real-world datasets show that our methods are competitive across domains.

Motivation & Objective

Extend the latent-variable interpretation of linear CCA to nonlinear, deep-observation models.
Derive variational lower bounds for data likelihood using test-time view-based posteriors.
Introduce VCCA-private to disentangle shared (common) and private information across views.
Provide scalable end-to-end training via stochastic gradient methods and reparameterization.
Demonstrate competitive performance across image-image, speech-articulation, and image-text benchmarks.

Proposed method

Model x and y as nonlinear observations pθ(x|z) and pθ(y|z) generated from latent z with a Gaussian prior p(z).
Approximate pθ(z|x) with qφ(z|x) and maximize a variational lower bound on pθ(x,y) via L(x,y;θ,φ).
Use the reparameterization trick to sample z from qφ(z|x) for Monte Carlo estimates of the bound.
Relate to MVAE by interpreting reconstruction terms as log pθ(x|z) and log pθ(y|z) with latent noise via Σ; connect to autoencoder-style objectives.
Provide VCCA-private by introducing private variables hx,yh with factorized posteriors qφ(z|x)qφ(hx|x)qφ(hy|y) and a corresponding bound.
Trainable via stochastic gradient descent with Adam, enabling end-to-end optimization.

Experimental results

Research questions

RQ1Can a deep probabilistic model recover a shared latent representation for multiple views while allowing nonlinear view-generating processes?
RQ2Does a variational objective enable tractable inference and sampling from the latent space for multi-view data?
RQ3Does introducing private, view-specific latent variables improve disentanglement and reconstruction without supervision?
RQ4How do VCCA and VCCA-private perform across image-image, speech-articulation, and image-text benchmarks compared to prior multi-view methods?
RQ5Can the learned representations support downstream tasks with or without access to all views at test time?

Key findings

VCCA and VCCA-private achieve competitive or superior downstream performance across datasets (MNIST, XRMB, MIR-Flickr).
VCCA can be trained end-to-end with stochastic gradient methods using a variational bound and reparameterization.
VCCA-private disentangles shared and private information, improving reconstruction quality and class separation in latent space.
On MNIST, VCCA achieves 3.0% (MNIST error) and VCCA-private achieves 2.4% (MNIST error) in the reported setup.
On XRMB, VCCA achieves 28.0% PER and VCCA-private achieves 25.2% PER, indicating competitive phonetic recognition performance.
On MIR-Flickr, VCCA and VCCA-private achieve higher mAP than several baselines and enable effective unimodal retrieval and cross-modal analysis.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.