QUICK REVIEW

[论文解读] AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks

Xianzhi Du, Mostafa El‐Khamy|arXiv (Cornell University)|Apr 19, 2019

Advanced Vision and Imaging参考文献 40被引用 49

一句话总结

AMNet 引入一个带有深度可分离 ResNet 主干及扩展成本体积的 Atrous Multiscale Network，以在 KITTI、SceneFlow 和 Middlebury 上实现最先进的立体视差；它也扩展为前景-背景感知变体（FBA-AMNet），通过多任务学习进行训练。

ABSTRACT

In this paper, a new deep learning architecture for stereo disparity estimation is proposed. The proposed atrous multiscale network (AMNet) adopts an efficient feature extractor with depthwise-separable convolutions and an extended cost volume that deploys novel stereo matching costs on the deep features. A stacked atrous multiscale network is proposed to aggregate rich multiscale contextual information from the cost volume which allows for estimating the disparity with high accuracy at multiple scales. AMNet can be further modified to be a foreground-background aware network, FBA-AMNet, which is capable of discriminating between the foreground and the background objects in the scene at multiple scales. An iterative multitask learning method is proposed to train FBA-AMNet end-to-end. The proposed disparity estimation networks, AMNet and FBA-AMNet, show accurate disparity estimates and advance the state of the art on the challenging Middlebury, KITTI 2012, KITTI 2015, and Sceneflow stereo disparity estimation benchmarks.

研究动机与目标

Develop a deep learning architecture for accurate stereo disparity estimation.
Enhance contextual information capture via atrous multiscale modules to improve multiscale disparity estimation.
Improve disparity accuracy with an extended cost volume combining multiple matching costs.
Explore foreground-background awareness as an auxiliary task to boost disparity quality.
Demonstrate state-of-the-art performance on KITTI, Sceneflow, and Middlebury benchmarks.

提出的方法

Use a depthwise separable ResNet (D-ResNet) as an efficient feature extractor with increased learning capacity.
Introduce an Atrous Multiscale (AM) module to aggregate multiscale contextual information without losing resolution.
Construct an Extended Cost Volume (ECV) that combines disparity-level feature concatenation, disparity-level feature distance, and disparity-level depthwise correlation.
Process the cost volume with a stacked AM (SAM) to progressively refine context aggregation.
Apply soft argmin disparity regression from outputs of AM modules; train with a multi-task loss including foreground-background segmentation in FBA-AMNet.
Optionally train an iterative multitask framework where foreground-background segmentation informs disparity estimation through multitask learning.

实验结果

研究问题

RQ1Can atrous multiscale context aggregation improve stereo disparity estimation over conventional encoder-decoder architectures?
RQ2Does an extended cost volume with multiple matching metrics enhance disparity accuracy?
RQ3Does foreground-background awareness via multitask learning further improve disparity estimates, particularly at object boundaries?
RQ4What are the performance gains on standard benchmarks (KITTI 2015/2012, SceneFlow, Middlebury) when using AMNet and FBA-AMNet?

主要发现

AMNet and FBA-AMNet achieve state-of-the-art disparity accuracy on KITTI 2015, KITTI 2012, and Sceneflow benchmarks.
AMNet-32 and FBA-AMNet-32 outperform prior methods with significant margins on D1-all in KITTI 2015 (e.g., FBA-AMNet-32 reaches 1.84% D1-all on all pixels).
AMNet-32 attains 0.74 in EPE on Sceneflow, surpassing the previous best by 32.1%.
FBA-AMNet-32 achieves the lowest reported disparity error on KITTI 2015 test set across evaluated variants (e.g., D1-all of 1.84% on all pixels).
Foreground-background awareness via multitask learning improves disparity estimation without requiring separate semantic segmentation during inference.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。