QUICK REVIEW

[论文解读] Unsupervised Monocular Depth Estimation with Left-Right Consistency

Clément Godard, Oisin Mac Aodha|arXiv (Cornell University)|Sep 13, 2016

Advanced Vision and Imaging参考文献 56被引用 32

一句话总结

本文提出了一种无监督单目深度估计方法，利用双目立体视频而非真实深度数据进行训练。通过引入一种新型损失函数，在训练过程中强制执行左右视差一致性，该模型在KITTI数据集上实现了最先进性能，甚至超越了部分使用真实深度标注进行训练的有监督方法。

ABSTRACT

Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.

研究动机与目标

解决用于训练单目深度模型的真实深度数据稀缺且成本高昂的问题。
实现仅使用立体图像对的端到端无监督训练，避免对显式深度监督的依赖。
通过强制从左视图和右视图预测的视差之间的一致性，提升深度估计质量。
在包括新采集的城市立体数据集在内的多样化数据集上展示模型的泛化能力。
在KITTI和Make3D基准测试中实现具有竞争力的性能，且无需任何真实深度监督。

提出的方法

该方法使用立体图像对作为输入，训练一个全卷积神经网络以从单张图像预测视差图。
采用基于可微采样的图像重建损失，通过使用预测视差对右图像进行变形以重建左图像。
引入一种新型的左右视差一致性损失，以强制要求从左图像和右图像预测的视差相互一致。
使用组合损失函数进行端到端训练：图像重建损失与左右视差一致性损失。
后处理包括中值滤波和边缘感知平滑，以优化深度预测结果。
在新数据集上仅使用立体数据对模型进行微调，从而实现对未见环境的泛化能力。

实验结果

研究问题

RQ1是否可以完全不依赖真实深度监督来有效训练单目深度估计模型？
RQ2在无监督设置下，强制执行左右视差一致性如何提升深度估计质量？
RQ3在仅使用立体数据训练的模型是否能无需微调即泛化到新的、未见过的数据集？
RQ4所提出的方法是否优于使用真实深度标注的有监督基线方法？
RQ5该方法在面对高光、透明和遮挡等挑战时是否具有鲁棒性？

主要发现

该模型在KITTI 2015自动驾驶数据集上实现了最先进性能，优于多个使用真实深度数据的有监督方法。
在KITTI数据集上，该方法的均方误差（Sq Rel）为15.517，绝对相对误差（Abs Rel）为0.893，RMSE为11.542，log10误差为0.223。
在Make3D数据集上，该方法的Sq Rel为11.990，Abs Rel为0.535，RMSE为11.513，log10误差为0.156，优于无监督基线方法，并在定性结果上匹配或超越部分有监督方法。
该模型在CamVid数据集和新采集的城市立体数据集上均表现出良好泛化能力，无需重新训练即可生成视觉上合理的深度图。
在Cityscapes预训练模型的基础上，使用新城市数据集进行微调后，可在相同相机拍摄的测试集上生成视觉上令人信服的深度预测结果。
与仅使用重建损失的训练相比，左右视差一致性损失显著提升了性能，尤其在遮挡边界以及细长结构（如电线杆和标识牌）处表现更优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。