QUICK REVIEW

[Paper Review] Monocular Object Instance Segmentation and Depth Ordering with CNNs

Ziyu Zhang, Alexander G. Schwing|arXiv (Cornell University)|May 12, 2015

Advanced Vision and Imaging38 references34 citations

TL;DR

This paper proposes a CNN-MRF framework for monocular instance-level segmentation and depth ordering from a single RGB image, using multi-scale patch predictions and a Markov Random Field to jointly optimize segmentation and depth ordering. It achieves state-of-the-art performance on the KITTI benchmark, outperforming baselines in instance-level metrics and depth ordering accuracy, particularly after post-processing with a 2% performance gain.

ABSTRACT

In this paper we tackle the problem of instance-level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance-level segmentations where the instance ID encodes the depth ordering within image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional neural nets applied at overlapping patches of different resolutions, as well as the output of a connected component algorithm. It aims to predict accurate instance-level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show good performance on both tasks.

Motivation & Objective

To address the challenge of jointly predicting instance-level segmentation and depth ordering from a single monocular image.
To eliminate reliance on object detection as input by jointly reasoning about detection, segmentation, and depth ordering.
To leverage weak supervision from 3D bounding boxes and stereo data during training, while requiring only a single RGB image at test time.
To improve accuracy and coherence of instance segmentation and depth ordering through a structured MRF that combines CNN predictions across multiple scales.
To demonstrate effectiveness on the complex, occlusion-rich KITTI benchmark for autonomous driving.

Proposed method

The method uses a CNN to predict depth-ordered instance segmentation on densely sampled image patches at multiple resolutions.
Unary potentials in the MRF are derived from CNN outputs on overlapping patches, encoding instance IDs that encode depth order.
Pairwise potentials in the MRF enforce consistency between neighboring pixels and connected components, using CNN-based affinity measures.
A connected component algorithm processes CNN outputs per patch to generate initial instance proposals.
The final segmentation and depth ordering are obtained by solving an energy minimization problem over a Markov Random Field combining unary and pairwise terms.
Post-processing via MRF inference significantly improves performance, especially on recall and depth ordering metrics.

Experimental results

Research questions

RQ1Can a CNN-MRF framework jointly predict accurate instance-level segmentation and depth ordering from a single monocular image without requiring object detection as input?
RQ2How effective is multi-scale patch-based CNN prediction combined with MRF inference for improving instance segmentation and depth ordering accuracy?
RQ3To what extent does the MRF-based post-processing improve performance compared to raw CNN predictions or unary-only inference?
RQ4How well does the method generalize to complex scenes with heavy occlusion, shadows, and small objects, as in the KITTI benchmark?
RQ5Can weakly supervised signals from 3D bounding boxes and stereo data be effectively leveraged to train a single-image instance segmentation and depth ordering model?

Key findings

The full MRF approach achieves 83.1% accuracy in correctly ordering randomly sampled foreground pixel pairs, significantly outperforming baselines.
The method improves instance-level metrics by around 2% after post-processing, with the strongest gains in recall and MUCov/MWCov metrics.
The pairwise MRF formulation outperforms unary-only inference after post-processing, indicating that structured inference is essential for performance.
The approach achieves strong performance on the KITTI benchmark, with high object precision and improved recall compared to the baseline.
The method successfully segments and orders up to five car instances in a single image patch, even in complex occlusion patterns.
Failure cases are primarily due to tiny cars missed by the CNN and merged instances from the connected component algorithm.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.