QUICK REVIEW

[Paper Review] Stacked Capsule Autoencoders

Adam R. Kosiorek, Sara Sabour|arXiv (Cornell University)|Jun 17, 2019

Generative Adversarial Networks and Image Synthesis36 references36 citations

TL;DR

Stacked Capsule Autoencoders (scae) learn object parts and their viewpoints without supervision, organizing parts into object capsules to achieve state-of-the-art unsupervised classification on MNIST and SVHN.

ABSTRACT

Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%). The code is available at https://github.com/google-research/google-research/tree/master/stacked_capsule_autoencoders

Motivation & Objective

Motivate unsupervised learning of structured object representations that are robust to viewpoint changes.
Develop a two-stage architecture (Part Capsule Autoencoder and Object Capsule Autoencoder) to segment parts and assemble them into objects.
Leverage geometric relationships between parts and objects to improve unsupervised classification and interpretability.

Proposed method

Introduce Constellation Autoencoder (ccae) to model sets of 2D points as constellations transformed by similarity transforms.
Develop Part Capsule Autoencoder (pcae) to infer part poses and presences from images and reconstruct via affine-transformed templates.
Stack pcae with Object Capsule Autoencoder (ocae) to form scae; object capsules predict part poses and mix predictions for reconstruction.
Model images as spatial Gaussian mixtures whose components come from transformed templates and part poses.
Incorporate sparsity and entropy-based losses to encourage diverse, specialized usage of capsules across examples.

Experimental results

Research questions

RQ1Can unsupervised training of part- and object-capsules discover meaningful object structure from images?
RQ2Do object capsule presences provide informative signals for unsupervised class discovery?
RQ3How do geometric transformations and part–viewer relationships enable viewpoint-invariant reasoning?
RQ4What is the impact of sparsity and encoder choices on unsupervised classification and generalization?

Key findings

scae achieves state-of-the-art unsupervised classification on MNIST (98.7% with lin-match; 99.0% with lin-pred) and SVHN (55.33% with lin-match; 67.27% with lin-pred).
The object capsule presence vectors form tight clusters correlating with class labels, enabling unsupervised class discovery.
Ablation studies show the contributions of sparsity losses, noise injection, transformation type, part-encoder choice, and the Set Transformer for object-capsule encoding.
Unsupervised clustering performance on MNIST improves with viewpoint generalization tasks (AffNIST) to 92.2% in one setup.
Using a two-stage architecture (pcae + ocae) plus ccae-based pretraining enables unsupervised segmentation and object discovery from images.
The approach underperforms on CIFAR-10 due to fixed templates and background modeling limitations, suggesting potential for deeper hierarchies or input-dependent templates.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.