QUICK REVIEW

[论文解读] AnyDepth: Depth Estimation Made Easy

Zeyu Ren, Zeyu Zhang|arXiv (Cornell University)|Jan 6, 2026

Advanced Vision and Imaging被引用 0

一句话总结

AnyDepth 引入一个轻量级、以数据为中心的零-shot 单目深度估计框架，用简单的单路径 Simple Depth Transformer (SDT) 取代繁重的多分支解码器，并结合一个质量感知的数据筛选策略。相比 DPT，在多个基准测试中以更少的参数量和更低的训练成本实现具有竞争力的精度。

ABSTRACT

Monocular depth estimation aims to recover the depth information of 3D scenes from 2D images. Recent work has made significant progress, but its reliance on large-scale datasets and complex decoders has limited its efficiency and generalization ability. In this paper, we propose a lightweight and data-centric framework for zero-shot monocular depth estimation. We first adopt DINOv3 as the visual encoder to obtain high-quality dense features. Secondly, to address the inherent drawbacks of the complex structure of the DPT, we design the Simple Depth Transformer (SDT), a compact transformer-based decoder. Compared to the DPT, it uses a single-path feature fusion and upsampling process to reduce the computational overhead of cross-scale feature fusion, achieving higher accuracy while reducing the number of parameters by approximately 85%-89%. Furthermore, we propose a quality-based filtering strategy to filter out harmful samples, thereby reducing dataset size while improving overall training quality. Extensive experiments on five benchmarks demonstrate that our framework surpasses the DPT in accuracy. This work highlights the importance of balancing model design and data quality for achieving efficient and generalizable zero-shot depth estimation. Code: https://github.com/AIGeeksGroup/AnyDepth. Website: https://aigeeksgroup.github.io/AnyDepth.

研究动机与目标

降低零-shot 单目深度估计中的模型和数据复杂度的动机。
提出一个轻量级解码器（SDT），以替代多分支跨尺度融合。
引入基于质量的数据筛选策略，以提升训练数据的利用效率。
显示 AnyDepth 在显著更少的参数和 FLOPs 下仍实现有竞争力的精度。

提出的方法

使用冻结的 DINOv3 编码器从四个 Transformer 层提取多尺度 token。
引入 Simple Depth Transformer (SDT)：单路径融合与一次性重建，只有一个线性投影用于 token 融合。
用可学习的层级权重融合多层 token，再映射到一个空间特征图上。
应用 Spatial Detail Enhancer (SDE) 以细化纹理细节与局部结构。
在渐进式两阶段上采样路径中，使用可学习的动态采样器（DySample）进行上采样。
采用 SSI 与梯度匹配损失进行训练；使用数据中心筛选去除低质量样本。

实验结果

研究问题

RQ1轻量级的 SDT 解码器是否能够在零-shot 单目深度估计方面达到与 DPT 相当的性能？
RQ2基于数据中心的筛选是否能够在使用更少数据的情况下提升训练质量和模型性能？
RQ3在高分辨率输入下，采用 SDT 与 DINOv3 骨干时，参数量、FLOPs、延迟等效率提升如何？
RQ4在没有大规模监督数据的情况下，AnyDepth 在室内和室外零-shot 深度基准中的表现如何？

主要发现

与 DPT 相比，SDT 将参数量大幅减少约 85%-89%，同时在零-shot 深度估计中实现更高的准确性。
基于质量的数据筛选策略能减少训练数据量并提升整体模型性能。
AnyDepth 配合 SDT 在 NYUv2、KITTI、ETH3D、ScanNet、DIODE 的零-shot 设置下实现与 DPT 相当的精度，同时具有更低的 FLOPs 和可比甚至更快的推理速度。
采用 DySample 的渐进上采样比双线性上采样更能保留高频细节，提升边缘清晰度与深度边界。
效率分析显示在不同模型规模和输入分辨率下，参数量和 FLOPs 显著下降，并且推理速度有边际或正向提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。