QUICK REVIEW

[论文解读] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Amaël Delaunoy|arXiv (Cornell University)|Oct 2, 2024

Advanced Measurement and Metrology Techniques被引用 14

一句话总结

Depth Pro 是一个零样本的度量单目深度模型，输出高分辨率、清晰、具有绝对尺度的深度图，在 0.3 秒内于 V100 GPU 上完成推理，采用多尺度 ViT 为基础的架构以及一个结合真实数据与合成数据的训练课程。

ABSTRACT

We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro

研究动机与目标

Develop a zero-shot, metric monocular depth estimator that outputs absolute-scale depth without camera intrinsics.
Achieve high-resolution, boundary-accurate depth maps with fine structures (hair, fur, vegetation).
Maintain low latency to enable interactive view synthesis and related applications.
Estimate focal length from a single image to provide robust metric depth without EXIF data.
Introduce evaluation metrics for depth-boundary fidelity using matting/segmentation datasets.

提出的方法

Apply a plain ViT-based architecture operating at a fixed high resolution (1536x1536) by processing patches across multiple scales and fusing them with a DPT-style decoder.
Predict canonical inverse depth C from an input image I, then compute metric depth via Dm = f_px / (w C) where f_px is the focal length and w is image width.
Train with a two-stage curriculum mixing real and synthetic datasets to balance boundary sharpness and pixelwise accuracy (Stage 1: robust cross-domain features; Stage 2: sharpen boundaries using high-quality synthetic ground truth).
Introduce multi-scale derivative losses (MAGE, MALE, MSGE) to enforce sharp boundaries and fine details across scales.
Propose zero-shot focal length estimation from intermediate features plus a dedicated focal-length head trained separately to predict the horizontal field of view.
Develop boundary-focused evaluation metrics leveraging matting/segmentation annotations to quantify occluding contours and boundary recall.

实验结果

研究问题

RQ1Can a zero-shot monocular depth model produce metric, absolute-scale depth without camera intrinsics?
RQ2Does a multi-scale ViT-based architecture yield sharper depth boundaries at high resolution while maintaining fast runtimes?
RQ3How can training with a mix of real and synthetic data, plus specialized boundary-focused losses, improve boundary fidelity in depth maps?
RQ4Is focal length estimable from a single image in a zero-shot setting with high accuracy?
RQ5Do new boundary-aware evaluation metrics correlate with practical improvements in view synthesis and editing tasks?

主要发现

Method	Booster	ETH3D	Middlebury	NuScenes	Sintel	Sun-RGBD	Avg. Rank
DepthAnything (Yang et al., 2024a)	52.3	9.3	39.3	35.4	6.9	85.0	5.7
DepthAnything v2 (Yang et al., 2024b)	59.5	36.3	37.2	17.7	5.9	72.4	5.8
Metric3D (Yin et al., 2023)	4.7	34.2	13.6	64.4	17.3	16.9	5.8
Metric3D v2 (Hu et al., 2024)	39.4	87.7	29.9	82.6	38.3	75.6	3.7
PatchFusion (Li et al., 2024a)	22.6	51.8	49.9	20.4	14.0	53.6	5.2
UniDepth (Piccinelli et al., 2024)	27.6	25.3	31.9	83.6	16.5	95.8	4.2
ZeroDepth (Guizilini et al., 2023)	OOM	OOM	46.5	64.3	12.9	OOM	4.6
ZoeDepth (Bhat et al., 2023)	21.6	34.2	53.8	28.1	7.8	85.7	5.3
Depth Pro (Ours)	46.6	41.5	60.5	49.1	40.0	89.0	2.5

Depth Pro achieves 2.25-megapixel depth maps at 0.3s on a V100 GPU with absolute metric depth and no camera intrinsics.
Depth Pro yields superior boundary accuracy, outperforming prior work by a multiplicative margin in boundary recall across multiple datasets.
On zero-shot metric depth, Depth Pro ranks best on average across Booster, ETH3D, Middlebury, NuScenes, Sintel, and Sun-RGBD datasets.
Depth Pro is considerably faster and sharper in boundaries than diffusion-based Marigold and patch-based PatchFusion baselines.
Focal length estimation from a single image with Depth Pro significantly outperforms prior focal-length predictors on a curated zero-shot dataset.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。