QUICK REVIEW

[Paper Review] Unsupervised learning through one-shot image-based shape reconstruction.

Dinesh Jayaraman, Ruohan Gao|arXiv (Cornell University)|Sep 1, 2017

Human Pose and Action Recognition17 references8 citations

TL;DR

This paper proposes a self-supervised, class-agnostic method for learning 3D shape representations from single 2D images using an encoder-decoder CNN. By training the model to reconstruct all unseen views from a single input image, it learns disentangled shape features that enable zero-shot mental rotation and outperform existing unsupervised methods on object recognition.

ABSTRACT

We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation. The main idea is a self-supervised training objective that, given only a single 2D image, requires all unseen views of the object to be predictable from learned features. We implement this idea as an encoder-decoder convolutional neural network. The network maps an input image of an unknown category and unknown viewpoint to a latent space, from which a deconvolutional decoder can best lift the image to its complete viewgrid showing the object from all viewing angles. Our class-agnostic training procedure encourages the representation to capture fundamental shape primitives and semantic regularities in a data-driven manner---without manual semantic labels. Our results on two widely-used shape datasets show 1) our approach successfully learns to perform mental rotation even for objects unseen during training, and 2) the learned latent space is a powerful representation for object recognition, outperforming several existing unsupervised feature learning methods.

Motivation & Objective

To develop an unsupervised feature learning method that captures 3D shape information from single-view images without category-specific supervision.
To enable generalization to unseen object categories by learning fundamental shape primitives and semantic regularities.
To eliminate the need for manual annotations by using a self-supervised objective based on view reconstruction.
To evaluate whether the learned representation supports zero-shot generalization and downstream recognition tasks.

Proposed method

The method uses an encoder-decoder convolutional neural network to map a single 2D image to a latent space and reconstruct a complete viewgrid of the object from all angles.
The self-supervised training objective requires the model to predict all unseen views from the encoded features, using only the input image as supervision.
The encoder extracts hierarchical features from a single image, while the decoder generates a multi-view output representing the object from all viewpoints.
The model is trained end-to-end using a reconstruction loss that minimizes the difference between predicted and ground-truth viewgrid images.
The approach is class-agnostic, meaning it does not require category labels or prior knowledge of object identities.
The latent space is optimized to encode shape-invariant features that support mental rotation and generalization.

Experimental results

Research questions

RQ1Can a model learn to reconstruct all missing views of an object from a single 2D image without any category labels or supervision?
RQ2Does the learned representation capture disentangled shape primitives and semantic regularities in a data-driven way?
RQ3Can the model generalize to objects it has never seen during training, performing mental rotation implicitly?
RQ4How well does the learned representation perform on downstream recognition tasks compared to existing unsupervised methods?
RQ5Is the latent space semantically meaningful and useful for zero-shot object recognition?

Key findings

The model successfully performs mental rotation on unseen objects, demonstrating generalization beyond the training distribution.
The learned latent space achieves state-of-the-art performance on object recognition among unsupervised methods on two benchmark shape datasets.
The approach outperforms several existing unsupervised feature learning baselines in downstream recognition tasks.
The model generalizes across object categories without requiring category-level annotations or fine-tuning.
The self-supervised objective effectively encourages the learning of shape-invariant and semantically regular features.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.