[Paper Review] PVSNet: Pixelwise Visibility-Aware Multi-View Stereo Network
PVSNet learns pixelwise visibility for neighboring views to weight their contribution in multi-view stereo, with an anti-noise training strategy, achieving state-of-the-art results across multiple datasets including ETH3D high-res.
Recently, learning-based multi-view stereo methods have achieved promising results. However, they all overlook the visibility difference among different views, which leads to an indiscriminate multi-view similarity definition and greatly limits their performance on datasets with strong viewpoint variations. In this paper, a Pixelwise Visibility-aware multi-view Stereo Network (PVSNet) is proposed for robust dense 3D reconstruction. We present a pixelwise visibility network to learn the visibility information for different neighboring images before computing the multi-view similarity, and then construct an adaptive weighted cost volume with the visibility information. Moreover, we present an anti-noise training strategy that introduces disturbing views during model training to make the pixelwise visibility network more distinguishable to unrelated views, which is different with the existing learning methods that only use two best neighboring views for training. To the best of our knowledge, PVSNet is the first deep learning framework that is able to capture the visibility information of different neighboring views. In this way, our method can be generalized well to different types of datasets, especially the ETH3D high-res benchmark with strong viewpoint variations. Extensive experiments show that PVSNet achieves the state-of-the-art performance on different datasets.
Motivation & Objective
- Motivate robust dense 3D reconstruction under strong viewpoint variations by modeling per-pixel visibility across views.
- Introduce a pixelwise visibility network to learn visibility maps for neighboring images relative to a reference view.
- Aggregate two-view cost volumes using learned visibility weights to form a robust unified cost volume.
- Propose an anti-noise training strategy that exposes disturbing views to improve robustness.
- Demonstrate state-of-the-art performance on multiple MVS benchmarks, including ETH3D high-res.
Proposed method
- Construct a two-view cost volume for each neighboring image via plane-sweep using multiple depth hypotheses.
- Regress a 2D pixelwise visibility map from each two-view cost volume using a 3D U-Net to capture occlusion and viewing-geometry effects.
- Aggregate all two-view costs into a single weighted cost volume using visibility maps as weights (C_agg = sum V_i' * C_ref,i / sum V_i').
- Perform cost volume filtering and inverse-depth regression with a 3D CNN-based pipeline to obtain depth maps.
- Extend to high-resolution estimation by refining depth iteratively using previous stage visibility to build thin, high-res cost volumes.
- Introduce an anti-noise training strategy by including the worst two views during training to improve discrimination of unrelated views.
Experimental results
Research questions
- RQ1Can pixelwise visibility information across neighboring views be learned and leveraged to improve MVS depth estimation?
- RQ2Does explicitly modeling visibility lead to more robust depth aggregation in datasets with strong viewpoint changes (e.g., ETH3D high-res)?
- RQ3Does an anti-noise training strategy reduce sensitivity to non-credible views and improve performance as more views are added?
Key findings
- PVSNet learns pixelwise visibility maps for neighboring views and uses them to weightedly aggregate two-view cost volumes, reducing the influence of noise from unrelated views.
- The anti-noise training strategy (AN) that includes disturbing views significantly improves robustness and performance as the number of input views increases.
- On the DTU dataset, the high-resolution version of PVSNet achieves state-of-the-art completeness and competitive accuracy and overall scores among learning-based methods.
- PVSNet with visibility estimation improves results on Tanks and Temples, including the Advanced dataset with stronger viewpoint variation.
- On ETH3D high-res benchmark, PVSNet is the first learning-based method evaluated and achieves competitive accuracy and completeness, comparable to Colmap with low-resolution input.
- Overall, PVSNet demonstrates strong generalization across indoor/outdoor scenes and datasets with varying viewpoint changes.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.