Skip to main content
QUICK REVIEW

[Paper Review] Unsupervised learning through one-shot image-based shape reconstruction.

Dinesh Jayaraman, Ruohan Gao|arXiv (Cornell University)|Sep 1, 2017
Human Pose and Action Recognition17 references8 citations
TL;DR

This paper proposes a self-supervised, class-agnostic method for learning 3D shape representations from single 2D images using an encoder-decoder CNN. By training the model to reconstruct all unseen views from a single input image, it learns disentangled shape features that enable zero-shot mental rotation and outperform existing unsupervised methods on object recognition.

ABSTRACT

We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation. The main idea is a self-supervised training objective that, given only a single 2D image, requires all unseen views of the object to be predictable from learned features. We implement this idea as an encoder-decoder convolutional neural network. The network maps an input image of an unknown category and unknown viewpoint to a latent space, from which a deconvolutional decoder can best lift the image to its complete viewgrid showing the object from all viewing angles. Our class-agnostic training procedure encourages the representation to capture fundamental shape primitives and semantic regularities in a data-driven manner---without manual semantic labels. Our results on two widely-used shape datasets show 1) our approach successfully learns to perform mental rotation even for objects unseen during training, and 2) the learned latent space is a powerful representation for object recognition, outperforming several existing unsupervised feature learning methods.

Motivation & Objective

  • To develop an unsupervised feature learning method that captures 3D shape information from single-view images without category-specific supervision.
  • To enable generalization to unseen object categories by learning fundamental shape primitives and semantic regularities.
  • To eliminate the need for manual annotations by using a self-supervised objective based on view reconstruction.
  • To evaluate whether the learned representation supports zero-shot generalization and downstream recognition tasks.

Proposed method

  • The method uses an encoder-decoder convolutional neural network to map a single 2D image to a latent space and reconstruct a complete viewgrid of the object from all angles.
  • The self-supervised training objective requires the model to predict all unseen views from the encoded features, using only the input image as supervision.
  • The encoder extracts hierarchical features from a single image, while the decoder generates a multi-view output representing the object from all viewpoints.
  • The model is trained end-to-end using a reconstruction loss that minimizes the difference between predicted and ground-truth viewgrid images.
  • The approach is class-agnostic, meaning it does not require category labels or prior knowledge of object identities.
  • The latent space is optimized to encode shape-invariant features that support mental rotation and generalization.

Experimental results

Research questions

  • RQ1Can a model learn to reconstruct all missing views of an object from a single 2D image without any category labels or supervision?
  • RQ2Does the learned representation capture disentangled shape primitives and semantic regularities in a data-driven way?
  • RQ3Can the model generalize to objects it has never seen during training, performing mental rotation implicitly?
  • RQ4How well does the learned representation perform on downstream recognition tasks compared to existing unsupervised methods?
  • RQ5Is the latent space semantically meaningful and useful for zero-shot object recognition?

Key findings

  • The model successfully performs mental rotation on unseen objects, demonstrating generalization beyond the training distribution.
  • The learned latent space achieves state-of-the-art performance on object recognition among unsupervised methods on two benchmark shape datasets.
  • The approach outperforms several existing unsupervised feature learning baselines in downstream recognition tasks.
  • The model generalizes across object categories without requiring category-level annotations or fine-tuning.
  • The self-supervised objective effectively encourages the learning of shape-invariant and semantically regular features.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.