QUICK REVIEW

[Paper Review] Learning to Reconstruct Shapes from Unseen Classes

Xiuming Zhang, Zhoutong Zhang|arXiv (Cornell University)|Dec 28, 2018

3D Shape Modeling and Analysis3 references80 citations

TL;DR

GenRe introduces a modular, geometry-aware pipeline that disentangles 2.5D depth, spherical map completion, and voxel refinement to reconstruct 3D shapes from single images, generalizing to unseen object categories.

ABSTRACT

From a single image, humans are able to perceive the full 3D shape of an object by exploiting learned shape priors from everyday life. Contemporary single-image 3D reconstruction algorithms aim to solve this task in a similar fashion, but often end up with priors that are highly biased by training classes. Here we present an algorithm, Generalizable Reconstruction (GenRe), designed to capture more generic, class-agnostic shape priors. We achieve this with an inference network and training procedure that combine 2.5D representations of visible surfaces (depth and silhouette), spherical shape representations of both visible and non-visible surfaces, and 3D voxel-based representations, in a principled manner that exploits the causal structure of how 3D shapes give rise to 2D images. Experiments demonstrate that GenRe performs well on single-view shape reconstruction, and generalizes to diverse novel objects from categories not seen during training.

Motivation & Objective

Motivate generalizable single-image 3D reconstruction beyond training classes.
Disentangle geometric projections from shape reconstruction to improve generalization.
Leverage 2.5D representations, spherical maps, and voxel space for accurate reconstruction.
Demonstrate state-of-the-art performance on seen and unseen classes and analyze component contributions.

Proposed method

Three cascaded modules connected by fixed geometric projections: a depth estimator (2D->2.5D), a spherical map projection (2.5D->S), a spherical map inpainting network (S->S), and a voxel projection (S->3D) followed by a voxel refinement network.
Depth is predicted from a single RGB image to provide a view-centered 2.5D sketch, which is then projected to a partial spherical map.
An inpainting network completes the partial spherical map, enabling projection to a full 3D voxel representation.
A voxel refinement network fuses depth-projected and spherical-map-projected voxel estimates to produce the final 3D shape.
All projections are fixed geometric operations; learnable components model only the surface geometry, improving generalization.
Training is viewer-centered, with 3D supervision aligned to the input image pose, to better generalize to unseen categories.

Experimental results

Research questions

RQ1Can disentangling geometric projections from learning improve generalization to unseen object classes in single-image 3D reconstruction?
RQ2Do 2.5D sketches and spherical-map representations enable better generalization than direct 3D completion in voxel space?
RQ3How does each module contribute to reconstruction accuracy on seen versus unseen categories?
RQ4Is the approach robust when transferring from synthetic ShapeNet data to real images (Pix3D dataset)?

Key findings

GenRe achieves state-of-the-art reconstruction performance for both seen and unseen classes in ShapeNet-based experiments.
A two-step, factorized approach (depth->spherical map inpainting->voxel projection) outperforms one-step spherical-map baselines.
On real images (Pix3D), GenRe generally outperforms baselines across unseen classes, with some exceptions (beds).
Depth estimation learned from three training categories generalizes to novel categories without significant degradation.
Spherical-map inpainting enables effective completion of non-visible surfaces and generalizes well to new shapes.
Viewer-centered supervision supports generalization to unseen categories better than object-centered supervision in many cases.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.