[论文解读] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Depth Pro 是一个零样本的度量单目深度模型,输出高分辨率、清晰、具有绝对尺度的深度图,在 0.3 秒内于 V100 GPU 上完成推理,采用多尺度 ViT 为基础的架构以及一个结合真实数据与合成数据的训练课程。
We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro
研究动机与目标
- Develop a zero-shot, metric monocular depth estimator that outputs absolute-scale depth without camera intrinsics.
- Achieve high-resolution, boundary-accurate depth maps with fine structures (hair, fur, vegetation).
- Maintain low latency to enable interactive view synthesis and related applications.
- Estimate focal length from a single image to provide robust metric depth without EXIF data.
- Introduce evaluation metrics for depth-boundary fidelity using matting/segmentation datasets.
提出的方法
- Apply a plain ViT-based architecture operating at a fixed high resolution (1536x1536) by processing patches across multiple scales and fusing them with a DPT-style decoder.
- Predict canonical inverse depth C from an input image I, then compute metric depth via Dm = f_px / (w C) where f_px is the focal length and w is image width.
- Train with a two-stage curriculum mixing real and synthetic datasets to balance boundary sharpness and pixelwise accuracy (Stage 1: robust cross-domain features; Stage 2: sharpen boundaries using high-quality synthetic ground truth).
- Introduce multi-scale derivative losses (MAGE, MALE, MSGE) to enforce sharp boundaries and fine details across scales.
- Propose zero-shot focal length estimation from intermediate features plus a dedicated focal-length head trained separately to predict the horizontal field of view.
- Develop boundary-focused evaluation metrics leveraging matting/segmentation annotations to quantify occluding contours and boundary recall.
实验结果
研究问题
- RQ1Can a zero-shot monocular depth model produce metric, absolute-scale depth without camera intrinsics?
- RQ2Does a multi-scale ViT-based architecture yield sharper depth boundaries at high resolution while maintaining fast runtimes?
- RQ3How can training with a mix of real and synthetic data, plus specialized boundary-focused losses, improve boundary fidelity in depth maps?
- RQ4Is focal length estimable from a single image in a zero-shot setting with high accuracy?
- RQ5Do new boundary-aware evaluation metrics correlate with practical improvements in view synthesis and editing tasks?
主要发现
| Method | Booster | ETH3D | Middlebury | NuScenes | Sintel | Sun-RGBD | Avg. Rank |
|---|---|---|---|---|---|---|---|
| DepthAnything (Yang et al., 2024a) | 52.3 | 9.3 | 39.3 | 35.4 | 6.9 | 85.0 | 5.7 |
| DepthAnything v2 (Yang et al., 2024b) | 59.5 | 36.3 | 37.2 | 17.7 | 5.9 | 72.4 | 5.8 |
| Metric3D (Yin et al., 2023) | 4.7 | 34.2 | 13.6 | 64.4 | 17.3 | 16.9 | 5.8 |
| Metric3D v2 (Hu et al., 2024) | 39.4 | 87.7 | 29.9 | 82.6 | 38.3 | 75.6 | 3.7 |
| PatchFusion (Li et al., 2024a) | 22.6 | 51.8 | 49.9 | 20.4 | 14.0 | 53.6 | 5.2 |
| UniDepth (Piccinelli et al., 2024) | 27.6 | 25.3 | 31.9 | 83.6 | 16.5 | 95.8 | 4.2 |
| ZeroDepth (Guizilini et al., 2023) | OOM | OOM | 46.5 | 64.3 | 12.9 | OOM | 4.6 |
| ZoeDepth (Bhat et al., 2023) | 21.6 | 34.2 | 53.8 | 28.1 | 7.8 | 85.7 | 5.3 |
| Depth Pro (Ours) | 46.6 | 41.5 | 60.5 | 49.1 | 40.0 | 89.0 | 2.5 |
- Depth Pro achieves 2.25-megapixel depth maps at 0.3s on a V100 GPU with absolute metric depth and no camera intrinsics.
- Depth Pro yields superior boundary accuracy, outperforming prior work by a multiplicative margin in boundary recall across multiple datasets.
- On zero-shot metric depth, Depth Pro ranks best on average across Booster, ETH3D, Middlebury, NuScenes, Sintel, and Sun-RGBD datasets.
- Depth Pro is considerably faster and sharper in boundaries than diffusion-based Marigold and patch-based PatchFusion baselines.
- Focal length estimation from a single image with Depth Pro significantly outperforms prior focal-length predictors on a curated zero-shot dataset.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。