QUICK REVIEW

[论文解读] A Large RGB-D Dataset for Semi-supervised Monocular Depth Estimation

Jae Hoon Cho, Dongbo Min|arXiv (Cornell University)|Apr 23, 2019

Advanced Vision and Imaging参考文献 65被引用 25

一句话总结

该论文提出了一种基于学生-教师框架的半监督单目深度估计方法：一个深度立体匹配网络（教师）从包含百万张图像的户外立体数据集中生成高质量的伪深度图，通过集成预测和立体置信度图进行优化，用于训练轻量级的单目深度网络（学生）。该方法实现了最先进性能，并生成了可迁移至语义分割和道路检测等下游任务的语义有意义特征。

ABSTRACT

Current self-supervised methods for monocular depth estimation are largely based on deeply nested convolutional networks that leverage stereo image pairs or monocular sequences during a training phase. However, they often exhibit inaccurate results around occluded regions and depth boundaries. In this paper, we present a simple yet effective approach for monocular depth estimation using stereo image pairs. The study aims to propose a student-teacher strategy in which a shallow student network is trained with the auxiliary information obtained from a deeper and more accurate teacher network. Specifically, we first train the stereo teacher network by fully utilizing the binocular perception of 3-D geometry and then use the depth predictions of the teacher network to train the student network for monocular depth inference. This enables us to exploit all available depth data from massive unlabeled stereo pairs. We propose a strategy that involves the use of a data ensemble to merge the multiple depth predictions of the teacher network to improve the training samples by collecting non-trivial knowledge beyond a single prediction. To refine the inaccurate depth estimation that is used when training the student network, we further propose stereo confidence-guided regression loss that handles the unreliable pseudo depth values in occlusion, texture-less region, and repetitive pattern. To complement the existing dataset comprising outdoor driving scenes, we built a novel large-scale dataset consisting of one million outdoor stereo images taken using hand-held stereo cameras. Finally, we demonstrate that the monocular depth estimation network provides feature representations that are suitable for high-level vision tasks. The experimental results for various outdoor scenarios demonstrate the effectiveness and flexibility of our approach, which outperforms state-of-the-art approaches.

研究动机与目标

通过利用大规模立体图像对，解决单目深度估计中密集且高质量深度监督的稀缺性问题。
提升在遮挡区域和无纹理区域的深度估计准确率，这些区域通常是自监督方法失效的区域。
开发一种半监督训练策略，以减少对昂贵真实深度图的依赖。
构建一个大规模且多样化的户外立体数据集，以支持鲁棒的深度估计。
证明单目深度预测可作为高层视觉应用（如语义分割和道路检测）的强大代理任务。

提出的方法

在少量真实深度图上训练一个深度立体匹配网络，作为教师网络。
教师网络从DIML/CVL数据集中大量未标注的立体图像对中生成伪真实深度图。
通过融合教师网络的多尺度预测结果，生成更准确且鲁棒的伪深度图。
生成立体置信度图，以识别不可靠区域（如遮挡区域、无纹理区域），并指导训练损失。
引入一种基于立体置信度的回归损失，以在学生网络训练过程中降低低置信度区域的监督权重。
使用伪深度图和置信度引导损失，训练轻量级的单目深度估计学生网络，使其在多样化户外场景中具备良好泛化能力。

实验结果

研究问题

RQ1在不依赖密集真实深度图的前提下，学生-教师框架能否有效将立体匹配知识迁移至单目深度估计？
RQ2集成预测和置信度图在提升挑战区域伪深度监督质量方面有何作用？
RQ3与标准自监督方法相比，所提方法在遮挡和无纹理区域的伪影减少程度如何？
RQ4通过该方法训练的单目深度估计能否作为高层视觉任务（如语义分割和道路检测）的强代理任务？
RQ5在基准数据集上，所提方法的性能与最先进方法相比如何？

主要发现

所提方法在户外基准测试中优于最先进自监督单目深度估计方法，实现了更高的深度估计准确率和更清晰的边界。
使用该方法预训练的模型在Cityscapes语义分割基准上达到65.47%的平均交并比（mIoU），与ImageNet预训练性能相当。
在KITTI道路检测基准上，该方法达到Fmax为95.65%和AP为94.46%，优于从零开始训练的模型以及ImageNet预训练模型。
集成预测和立体置信度图显著提升了伪深度图质量，尤其在遮挡和无纹理区域表现更优。
通过该框架训练的单目深度网络生成了语义有意义的特征，展现出对下游任务的强大迁移能力。
该方法仅使用少量真实深度监督和大规模立体数据集，即实现了最先进性能，显著降低了对昂贵LiDAR数据的依赖。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。