QUICK REVIEW

[论文解读] Self-Supervised Monocular Depth Estimation with Internal Feature Fusion

Hang Zhou, David Greenwood|arXiv (Cornell University)|Oct 18, 2021

Advanced Vision and Imaging参考文献 40被引用 58

一句话总结

DIFFNet 使用高分辨率 HRNet 编码器，内部多阶段特征融合与基于注意力的解码器以提升自监督单目深度估计，在 KITTI 高分辨率结果上达到领先水平，尤其在较高分辨率时。

ABSTRACT

Self-supervised learning for depth estimation uses geometry in image sequences for supervision and shows promising results. Like many computer vision tasks, depth network performance is determined by the capability to learn accurate spatial and semantic representations from images. Therefore, it is natural to exploit semantic segmentation networks for depth estimation. In this work, based on a well-developed semantic segmentation network HRNet, we propose a novel depth estimation network DIFFNet, which can make use of semantic information in down and upsampling procedures. By applying feature fusion and an attention mechanism, our proposed method outperforms the state-of-the-art monocular depth estimation methods on the KITTI benchmark. Our method also demonstrates greater potential on higher resolution training data. We propose an additional extended evaluation strategy by establishing a test set of challenging cases, empirically derived from the standard benchmark.

研究动机与目标

在 SfM 监督下，将自监督学习框架下的单图深度估计作为研究动机。
探索语义丰富、分辨率高的特征如何在编码器内融合，以弥合语义与空间差距。
提出 DIFFNet，具有内部多阶段特征融合和基于注意力的解码器以提升深度精度。
展示 KITTI 的 state-of-the-art 结果并在具有挑战性的场景上进行扩展评估。

提出的方法

将 HRNet 作为深度编码器，以保持高分辨率且具有语义丰富的特征。
通过在 HRNet 流之间串联多阶段特征来实现内部特征融合，创建语义多样化且分辨率高的表示。
实现带注意力模块的解码器，在 U-Net 风格结构中对跳跃连接进行处理。
评估三种注意力策略（通道式、空间式、通道-空间）并选择通道式注意力作为最佳方案。
在自监督框架中使用基于光度和 SSIM 的损失以及标准的深度平滑正则化进行训练。
进行消融实验以分离预训练、多阶段融合和注意力对深度精度的影响。

实验结果

研究问题

RQ1在语义骨干中多阶段高分辨率特征的内部融合如何在自监督条件下改进单目深度估计？
RQ2不同注意力机制在解码跳跃连接用于深度图时有什么影响？
RQ3DIFFNet 是否在 KITTI 上超过现有自监督方法，尤其在更高输入分辨率下？
RQ4在具有挑战性的 KITTI 案例上进行扩展评估是否能揭示语义信息对深度估计的鲁棒性优势？

主要发现

方法	训练	宽×高	绝对相对误差	平方相对误差	均方根误差	RMSE 对数	delta1	delta2	delta3
SfMlearner	M	640x192	0.183	1.595	6.709	0.270	0.734	0.902	0.959
Li	M	416x128	0.130	0.950	5.138	0.209	0.843	0.948	0.978
Chen	M+Se	512x256	0.118	0.905	5.096	0.211	0.839	0.945	0.977
Monodepth2	M	640x192	0.115	0.903	4.863	0.193	0.877	0.959	0.981
SGDepth	M+Se	640x192	0.113	0.835	4.693	0.191	0.879	0.961	0.981
SAFENet	M+Se	640x192	0.112	0.788	4.582	0.187	0.878	0.963	0.983
VC-Depth	M	640x192	0.112	0.816	4.715	0.190	0.880	0.960	0.982
PackNet	M	640x192	0.111	0.785	4.601	0.189	0.878	0.960	0.982
Mono-Uncertainty	M	640x192	0.111	0.863	4.756	0.188	0.881	0.961	0.982
Fang	M	640x192	0.111	-	4.660	0.186	0.884	0.962	0.982
HR-depth	M	640x192	0.109	0.792	4.632	0.185	0.887	0.962	0.983
DIFFNet	M	640x192	0.102	0.764	4.483	0.180	0.896	0.965	0.983
Monodepth2	MS	640x192	0.106	0.818	4.750	0.196	0.874	0.957	0.979
HR-depth	MS	640x192	0.107	0.785	4.612	0.185	0.887	0.962	0.982
Fang	MS	640x192	0.101	-	4.512	0.188	0.881	0.961	0.981
DIFFNet	MS	640x192	0.101	0.749	4.445	0.179	0.898	0.965	0.983
Monodepth2	MS	1024x320	0.115	0.882	4.701	0.190	0.879	0.961	0.982
Fang	MS	1024x320	0.109	-	4.581	0.185	0.890	0.964	0.983
PackNet	MS	1280x384	0.107	0.802	4.538	0.186	0.889	0.962	0.981
SGDepth	MS	1280x384	0.107	0.768	4.468	0.186	0.891	0.963	0.982
SAFENet	MS	1024x320	0.106	0.743	4.489	0.181	0.884	0.965	0.984
HR-depth	MS	1024x320	0.106	0.755	4.472	0.181	0.892	0.966	0.984
Feat-Depth	MS	1024x320	0.104	0.729	4.481	0.179	0.893	0.965	0.984
Guizilini	MS	1280x384	0.100	0.761	4.270	0.175	0.902	0.965	0.982
DIFFNet	MS	1024x320	0.097	0.722	4.345	0.174	0.907	0.967	0.984

DIFFNet 在 KITTI 上达到_STATE-OF-THE-ART_ 或具有竞争力的结果，优于自监督方法在标准指标上的表现。
对编码器进行 ImageNet 预训练在消融组件中带来最大的性能提升。
通道式注意力在解码器中的精度优于空间式或通道-空间注意力。
多阶段特征融合在不同注意力配置下对深度预测具有一致的提升作用。
在较高分辨率（1024x320）下，DIFFNet 进一步提高准确性并保持对比方法的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。