QUICK REVIEW

[Paper Review] Recurrent Neural Network for (Un-)supervised Learning of Monocular VideoVisual Odometry and Depth

Rui Wang, Stephen M. Pizer|arXiv (Cornell University)|Apr 15, 2019

Advanced Vision and Imaging43 references41 citations

TL;DR

The paper introduces an RNN-based framework that jointly estimates depth and visual odometry from monocular video, enabling supervised or unsupervised training with multi-view reprojection and forward-backward flow-consistency losses, achieving state-of-the-art results on KITTI.

ABSTRACT

Deep learning-based, single-view depth estimation methods have recently shown highly promising results. However, such methods ignore one of the most important features for determining depth in the human vision system, which is motion. We propose a learning-based, multi-view dense depth map and odometry estimation method that uses Recurrent Neural Networks (RNN) and trains utilizing multi-view image reprojection and forward-backward flow-consistency losses. Our model can be trained in a supervised or even unsupervised mode. It is designed for depth and visual odometry estimation from video where the input frames are temporally correlated. However, it also generalizes to single-view depth estimation. Our method produces superior results to the state-of-the-art approaches for single-view and multi-view learning-based depth estimation on the KITTI driving dataset.

Motivation & Objective

Leverage temporal information in monocular video to improve depth and pose estimation.
Enable simultaneous depth and visual odometry estimation using ConvLSTM units.
Develop robust self-supervised training via multi-view reprojection and forward-backward flow constraints.
Maintain consistent scene scale across arbitrary-length sequences.
Demonstrate superior performance on KITTI compared to state-of-the-art methods.

Proposed method

Two networks: a ConvLSTM-integrated depth network (encoder-decoder) producing depth Z_t and hidden state h_t^d.
A visual odometry network based on a VGG16 backbone with ConvLSTM units that outputs relative 6DoF pose P_t→t-1.
Training uses differentiable geometric module to perform multi-view image warping from Z_t and P_t→t-1.
Multi-view reprojection loss L_fw/L_bw aligns current view with previous views via differentiable warping.
Forward-backward flow-consistency loss enforces consistency between forward and backward optical flow.
Optional absolute depth loss L_depth (and alternative smoothness variants) to achieve absolute scale when ground truth is available.

Experimental results

Research questions

RQ1Can ConvLSTM-based architectures exploit temporal information to improve monocular depth estimation and ego-motion over multiple frames?
RQ2Does incorporating multi-view reprojection and forward-backward flow-consistency improve unsupervised depth and pose estimation compared to pairwise reprojection alone?
RQ3Can the proposed framework achieve consistent scene scale and operate on arbitrary-length sequences?
RQ4How does the method perform under supervised versus unsupervised training on KITTI?
RQ5What is the impact of recurrent unit placement and temporal window size on depth/pose accuracy?

Key findings

The method achieves superior results to state-of-the-art for both supervised and unsupervised depth estimation on KITTI.
Unsupervised training with multi-view reprojection and flow consistency outperforms several supervised baselines and other unsupervised methods.
Encoder-only ConvLSTM placement in the depth network yields better depth/pose performance than full or decoder placements.
Multi-view reprojection losses provide stronger supervision than consecutive reprojection, especially in unsupervised settings.
Depth estimation improves with larger temporal windows up to around 10 frames, then plateaus, while the model supports arbitrary-length sequences.
The framework produces depth at multiple scales and maintains a consistent scene scale across long sequences.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.