[论文解读] Attention-based Context Aggregation Network for Monocular Depth Estimation
本文提出了一种基于注意力机制的上下文聚合网络(ACAN),用于单目深度估计。该方法利用自注意力机制自适应地建模长距离像素级与图像级上下文,减少固定空洞率带来的网格伪影。通过引入软序数推理,最小化离散化误差,在NYU Depth V2和KITTI基准上实现最先进性能,使用ResNet-101时在KITTI上的RMSE为3.599。
Depth estimation is a traditional computer vision task, which plays a crucial role in understanding 3D scene geometry. Recently, deep-convolutional-neural-networks based methods have achieved promising results in the monocular depth estimation field. Specifically, the framework that combines the multi-scale features extracted by the dilated convolution based block (atrous spatial pyramid pooling, ASPP) has gained the significant improvement in the dense labeling task. However, the discretized and predefined dilation rates cannot capture the continuous context information that differs in diverse scenes and easily introduce the grid artifacts in depth estimation. In this paper, we propose an attention-based context aggregation network (ACAN) to tackle these difficulties. Based on the self-attention model, ACAN adaptively learns the task-specific similarities between pixels to model the context information. First, we recast the monocular depth estimation as a dense labeling multi-class classification problem. Then we propose a soft ordinal inference to transform the predicted probabilities to continuous depth values, which can reduce the discretization error (about 1% decrease in RMSE). Second, the proposed ACAN aggregates both the image-level and pixel-level context information for depth estimation, where the former expresses the statistical characteristic of the whole image and the latter extracts the long-range spatial dependencies for each pixel. Third, for further reducing the inconsistency between the RGB image and depth map, we construct an attention loss to minimize their information entropy. We evaluate on public monocular depth-estimation benchmark datasets (including NYU Depth V2, KITTI). The experiments demonstrate the superiority of our proposed ACAN and achieve the competitive results with the state of the arts.
研究动机与目标
- 为解决单目深度估计中固定空洞率空洞空间金字塔池化(ASPP)的局限性,其导致网格伪影且无法捕捉连续场景上下文。
- 通过自注意力机制建模像素级长距离依赖关系与图像级统计上下文,提升深度估计性能。
- 通过将任务重新表述为软序数分类问题,减少深度预测中的离散化误差。
- 通过基于注意力的熵最小化损失,增强RGB图像与预测深度图之间的对齐。
提出的方法
- 将单目深度估计重新表述为密集多分类问题,以支持序数概率学习。
- 引入软序数推理,将预测概率转换为连续深度值,使RMSE降低约1%。
- 在解码器中引入自注意力模块,学习特定任务的像素级相似性并捕捉长距离空间依赖。
- 引入图像级池化模块,提取全局统计上下文,与像素级注意力形成互补。
- 使用带有空洞卷积的残差编码器(ResNet),以保留空间分辨率并避免过度下采样。
- 提出一种注意力损失,通过最小化RGB特征与预测深度图之间的信息熵,改善特征对齐。
实验结果
研究问题
- RQ1自注意力机制能否有效建模单目深度估计中的连续、场景相关的上下文,超越ASPP等固定空洞率方法?
- RQ2软序数推理与标准回归或硬分类相比,在减少深度预测离散化误差方面表现如何?
- RQ3结合像素级与图像级上下文在多大程度上提升深度估计精度?
- RQ4基于注意力的损失通过最小化RGB与深度特征之间的熵,能否提升特征一致性与预测质量?
主要发现
- ACAN在使用ResNet-101时,于KITTI数据集上取得3.599的RMSE,优于所有对比的最先进方法。
- 通过软序数推理,离散化误差在RMSE上降低约1%,提升了深度的连续性。
- 定性结果表明,ACAN生成的边界更清晰、深度图更细节丰富,而ASPP等方法则存在网格伪影。
- 基于注意力的损失显著改善了RGB与深度特征之间的对齐,减少了预测中的噪声与不一致性。
- 在NYU Depth V2数据集上,ACAN展现出更优性能,尤其在复杂场景中表现出更强的泛化能力与细节保持能力。
- 消融实验确认,像素级与图像级上下文聚合均对最终性能提升有显著贡献。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。