[論文レビュー] AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks
AMNetはAtrous Multiscale Networkを深度wise separable ResNetバックボーンと拡張コストボリュームと組み合わせることで、KITTI、SceneFlow、Middleburyにおいて最先端のステレオ視差を達成する。 foreground-background aware variant (FBA-AMNet) も multitask 学習で訓練可能。
In this paper, a new deep learning architecture for stereo disparity estimation is proposed. The proposed atrous multiscale network (AMNet) adopts an efficient feature extractor with depthwise-separable convolutions and an extended cost volume that deploys novel stereo matching costs on the deep features. A stacked atrous multiscale network is proposed to aggregate rich multiscale contextual information from the cost volume which allows for estimating the disparity with high accuracy at multiple scales. AMNet can be further modified to be a foreground-background aware network, FBA-AMNet, which is capable of discriminating between the foreground and the background objects in the scene at multiple scales. An iterative multitask learning method is proposed to train FBA-AMNet end-to-end. The proposed disparity estimation networks, AMNet and FBA-AMNet, show accurate disparity estimates and advance the state of the art on the challenging Middlebury, KITTI 2012, KITTI 2015, and Sceneflow stereo disparity estimation benchmarks.
研究の動機と目的
- Develop a deep learning architecture for accurate stereo disparity estimation.
- Enhance contextual information capture via atrous multiscale modules to improve multiscale disparity estimation.
- Improve disparity accuracy with an extended cost volume combining multiple matching costs.
- Explore foreground-background awareness as an auxiliary task to boost disparity quality.
- Demonstrate state-of-the-art performance on KITTI, Sceneflow, and Middlebury benchmarks.
提案手法
- Use a depthwise separable ResNet (D-ResNet) as an efficient feature extractor with increased learning capacity.
- Introduce an Atrous Multiscale (AM) module to aggregate multiscale contextual information without losing resolution.
- Construct an Extended Cost Volume (ECV) that combines disparity-level feature concatenation, disparity-level feature distance, and disparity-level depthwise correlation.
- Process the cost volume with a stacked AM (SAM) to progressively refine context aggregation.
- Apply soft argmin disparity regression from outputs of AM modules; train with a multi-task loss including foreground-background segmentation in FBA-AMNet.
- Optionally train an iterative multitask framework where foreground-background segmentation informs disparity estimation through multitask learning.
実験結果
リサーチクエスチョン
- RQ1Can atrous multiscale context aggregation improve stereo disparity estimation over conventional encoder-decoder architectures?
- RQ2Does an extended cost volume with multiple matching metrics enhance disparity accuracy?
- RQ3Does foreground-background awareness via multitask learning further improve disparity estimates, particularly at object boundaries?
- RQ4What are the performance gains on standard benchmarks (KITTI 2015/2012, SceneFlow, Middlebury) when using AMNet and FBA-AMNet?
主な発見
| Method | D1-bg (All) | D1-fg (All) | D1-all (All) | D1-bg (Non-Occluded) | D1-fg (Non-Occluded) | D1-all (Non-Occluded) | Runtime |
|---|---|---|---|---|---|---|---|
| GC-Net | 2.21% | 6.16% | 2.87% | 2.02% | 5.58% | 2.61% | 0.9 s |
| PDSNet | 2.29% | 4.05% | 2.58% | 2.09% | 3.68% | 2.36% | 0.5 s |
| PSMNet | 1.86% | 4.62% | 2.32% | 1.71% | 4.31% | 2.14% | 0.41 s |
| SegStereo | 1.88% | 4.07% | 2.25% | 1.72% | 3.41% | 2.00% | 0.7 s |
| MC-CSPN | 1.56% | 3.78% | 1.93% | 2.12% | 3.85% | 2.40% | 0.9 s |
| AMNet-8 | 1.64% | 3.96% | 2.03% | 1.50% | 3.75% | 1.87% | 0.7 s |
| AMNet-32 | 1.60% | 3.81% | 1.97% | 1.43% | 3.48% | 1.77% | 0.9 s |
| FBA-AMNet-8 | 1.60% | 3.88% | 1.98% | 1.45% | 3.74% | 1.82% | 0.7 s |
| FBA-AMNet-32 | 1.53% | 3.43% | 1.84% | 1.39% | 3.20% | 1.69% | 0.9 s |
- AMNet and FBA-AMNet achieve state-of-the-art disparity accuracy on KITTI 2015, KITTI 2012, and Sceneflow benchmarks.
- AMNet-32 and FBA-AMNet-32 outperform prior methods with significant margins on D1-all in KITTI 2015 (e.g., FBA-AMNet-32 reaches 1.84% D1-all on all pixels).
- AMNet-32 attains 0.74 in EPE on Sceneflow, surpassing the previous best by 32.1%.
- FBA-AMNet-32 achieves the lowest reported disparity error on KITTI 2015 test set across evaluated variants (e.g., D1-all of 1.84% on all pixels).
- Foreground-background awareness via multitask learning improves disparity estimation without requiring separate semantic segmentation during inference.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。