[Paper Review] Deep Ordinal Regression Network for Monocular Depth Estimation
This paper presents a Deep Ordinal Regression Network (DORN) for monocular depth estimation that uses spacing-increasing discretization (SID) and an ordinal regression loss, achieving state-of-the-art results on multiple benchmarks with a lightweight, multi-scale architecture that avoids heavy spatial pooling.
Monocular depth estimation, which plays a crucial role in understanding 3D scene geometry, is an ill-posed problem. Recent methods have gained significant improvement by exploring image-level information and hierarchical features from deep convolutional neural networks (DCNNs). These methods model depth estimation as a regression problem and train the regression networks by minimizing mean squared error, which suffers from slow convergence and unsatisfactory local solutions. Besides, existing depth estimation networks employ repeated spatial pooling operations, resulting in undesirable low-resolution feature maps. To obtain high-resolution depth maps, skip-connections or multi-layer deconvolution networks are required, which complicates network training and consumes much more computations. To eliminate or at least largely reduce these problems, we introduce a spacing-increasing discretization (SID) strategy to discretize depth and recast depth network learning as an ordinal regression problem. By training the network using an ordinary regression loss, our method achieves much higher accuracy and \dd{faster convergence in synch}. Furthermore, we adopt a multi-scale network structure which avoids unnecessary spatial pooling and captures multi-scale information in parallel. The method described in this paper achieves state-of-the-art results on four challenging benchmarks, i.e., KITTI [17], ScanNet [9], Make3D [50], and NYU Depth v2 [42], and win the 1st prize in Robust Vision Challenge 2018. Code has been made available at: https://github.com/hufu6371/DORN.
Motivation & Objective
- Address the ill-posed nature of monocular depth estimation from a single image.
- Improve training convergence and final accuracy over standard regression with MSE losses.
- Avoid aggressive spatial pooling by using a high-resolution, multi-scale architecture with dilated convolutions.
- Introduce a spacing-increasing discretization strategy and an ordinal regression loss to train depth networks end-to-end.
- Demonstrate state-of-the-art performance on four challenging depth benchmarks and provide practical guidelines for depth discretization and network design.
Proposed method
- Discretize continuous depth values into intervals using spacing-increasing discretization (SID) rather than uniform discretization (UD).
- Cast depth estimation as an ordinal regression problem and optimize with a tailored ordinal regression loss that accounts for label ordering.
- Adopt a dilated-convolution based dense feature extractor that preserves resolution, removing last downsampling layers to avoid loss of spatial detail.
- Incorporate a multi-scale scene understanding module (ASPP with multiple dilation rates, a cross-channel branch, and a lightweight full-image encoder) to capture global and multi-scale information.
- Train the network end-to-end without stage-wise training or iterative refinement.
- Decode predicted discrete depth by averaging the interval thresholds around the most probable ordinal label.
Experimental results
Research questions
- RQ1Does SID discretization with ordinal regression improve depth estimation accuracy and convergence compared to regression-based training?
- RQ2What is the impact of a dilated convolution based architecture and avoidance of heavy pooling on depth map quality and computation?
- RQ3How does the proposed full-image encoder contribute to performance relative to other global-context strategies?
- RQ4How sensitive is performance to the number of depth intervals used in SID?
- RQ5Do the gains generalize across outdoor and indoor benchmark datasets (KITTI, ScanNet, Make3D, NYU Depth v2)?
Key findings
- DORN achieves state-of-the-art results on KITTI, ScanNet, Make3D, and NYU Depth v2 benchmarks.
- SID outperforms uniform discretization in depth estimation performance.
- Ordinal regression loss with ordered depth intervals improves convergence and accuracy over standard regression losses.
- A compact full-image encoder significantly reduces parameters while providing competitive or better performance than fc-based full-image approaches.
- Removing the last pooling layers and using dilated convolutions yields high-resolution depth maps without heavy multi-scale fusion.
- The method performs well on both outdoor and indoor datasets and ranks favorably on online evaluation servers.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.