QUICK REVIEW

[论文解读] Depth Anything V2

Lihe Yang, Bingyi Kang|arXiv (Cornell University)|Jun 13, 2024

Advanced Vision and Imaging被引用 9

一句话总结

Depth Anything V2 通过在精确的合成标签上训练来构建一个稳健的单目深度估计模型，然后通过教师-学生框架利用大规模伪标注的真实图像，以实现细粒度、鲁棒的深度预测和一个多用途的评估基准（DA-2K）。

ABSTRACT

This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.

研究动机与目标

用合成深度标签替换真实带标签图像，以提高精度和细节。
扩大神经网络教师模型，并使用其预测来引导学生模型。
用大规模伪标注真实图像来消除合成域和真实域之间的差距，从而提高泛化性。
提供模型多样性（25M 到 1.3B 参数），并支持下游任务的微调。
引入一个多用途、分辨率高的深度估计评估基准（DA-2K）。

提出的方法

在精确的合成深度数据上训练一个高容量的教师模型。
用教师的伪深度标注对大规模未标注真实图像进行标注。
仅在伪标注的真实图像上训练学生模型，以实现零样本泛化。
用度量深度标签对基础模型进行微调，以获得度量深度模型。
使用仿射不变的逆深度表示以及两种损失（尺度不变/平移不变损失；梯度匹配损失）来监督。
在伪标注数据上加入额外的特征对齐损失，以保留来自预训练编码器的语义。

实验结果

研究问题

RQ1高效的判别模型是否能够在不进行大规模扩散建模的情况下实现细粒度的深度细节？
RQ2使用合成数据进行单目深度估计的局限性有哪些，以及如何缓解？
RQ3如何利用未标注的真实图像来弥合合成到真实的差距，并提高较小模型的泛化能力？

主要发现

用合成图像替换所有带标签的真实图像，可以获得精确的深度标签和详细的监督。
在合成数据上训练的大容量教师结合对真实图像的伪标注，显著提升鲁棒性和细粒度深度预测。
Depth Anything V2 提供多种模型尺度（从 25M 到 1.3B 参数），推理速度比同类基于 SD 的模型更快。
使用伪标注的真实图像作为训练数据增强零样本性能并扩大场景覆盖。
一个新的评估基准 DA-2K 提供多样化的高分辨率场景和精确的稀疏深度标签，以更好地反映真实世界的 MDE 性能。
真实数据的伪标签在迁移任务（如 KITTI、NYU-D）中优于人工标注的真实数据。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。