QUICK REVIEW

[논문 리뷰] AnyDepth: Depth Estimation Made Easy

Zeyu Ren, Zeyu Zhang|arXiv (Cornell University)|2026. 01. 06.

Advanced Vision and Imaging인용 수 0

한 줄 요약

AnyDepth는 경량의 데이터 중심 프레임워크를 도입하여 제로샷 모노큘러 깊이 추정을 수행하고, 무거운 다분기 디코더를 간단한 단일 경로(Simple Depth Transformer, SDT) 및 품질 인식 데이터 여과 전략으로 대체합니다. 여러 벤치마크에 걸쳐 DPT보다 훨씬 적은 매개변수와 더 낮은 학습 비용으로 경쟁력 있는 정확도를 달성합니다.

ABSTRACT

Monocular depth estimation aims to recover the depth information of 3D scenes from 2D images. Recent work has made significant progress, but its reliance on large-scale datasets and complex decoders has limited its efficiency and generalization ability. In this paper, we propose a lightweight and data-centric framework for zero-shot monocular depth estimation. We first adopt DINOv3 as the visual encoder to obtain high-quality dense features. Secondly, to address the inherent drawbacks of the complex structure of the DPT, we design the Simple Depth Transformer (SDT), a compact transformer-based decoder. Compared to the DPT, it uses a single-path feature fusion and upsampling process to reduce the computational overhead of cross-scale feature fusion, achieving higher accuracy while reducing the number of parameters by approximately 85%-89%. Furthermore, we propose a quality-based filtering strategy to filter out harmful samples, thereby reducing dataset size while improving overall training quality. Extensive experiments on five benchmarks demonstrate that our framework surpasses the DPT in accuracy. This work highlights the importance of balancing model design and data quality for achieving efficient and generalizable zero-shot depth estimation. Code: https://github.com/AIGeeksGroup/AnyDepth. Website: https://aigeeksgroup.github.io/AnyDepth.

연구 동기 및 목표

제로샷 모노큘러 깊이 추정을 위한 모델 및 데이터 복잡도 감소를 촉진한다.
다중 가지 교차 스케일 융합을 대체하기 위해 경량 디코더(SDT)를 제안한다.
학습 데이터 효율성을 향상시키기 위한 품질 기반 데이터 필터링 전략을 도입한다.
AnyDepth가 훨씬 적은 매개변수와 FLOPs로도 경쟁력 있는 정확도를 달성함을 보여준다.

제안 방법

네 개의 트랜스포머 층에서 다중 스케일 토큰을 추출하기 위해 고정된 DINOv3 인코더를 사용한다.
Simple Depth Transformer(SDT)를 도입한다: 토큰 융합을 위한 단일 선형 프로젝션으로 단일 경로 융합 및 원샷 재구성이 가능하다.
학습 가능한 계층별 가중치로 다층 토큰을 융합한 후 공간 피처 맵으로 매핑한다.
Texture 세부사항과 로컬 구조를 정제하기 위해 Spatial Detail Enhancer(SDE)를 적용한다.
프로그레시브 이단계 업샘플링 경로에서 학습 가능한 다이나믹 샘플러(DySample)로 업샘플링한다.
SSI 및 그래디언트 매칭 손실로 학습하고, 데이터 중심 필터링을 사용하여 품질이 낮은 샘플을 제거한다.

실험 결과

연구 질문

RQ1경량 SDT 디코더가 DPT와 비교해 제로샷 모노큘러 깊이 추정에서 경쟁력 있는 성능을 달성할 수 있는가?
RQ2데이터 중심 필터링이 더 적은 데이터로 학습 품질과 모델 성능을 향상시키는가?
RQ3고해상도 입력에 대해 DINOv3 백본과 함께 SDT를 채택할 때 얻어지는 효율성(매개변수, FLOPs, 지연)의 이점은 무엇인가?
RQ4대규모 감독 학습 데이터 없이 실내 및 실외 제로샷 깊이 벤치마크에서 AnyDepth의 성능은 어떠한가?

주요 결과

SDT는 DPT에 비해 약 85%-89%의 매개변수를 감소시키면서 제로샷 깊이 추정에서 더 높은 정확도를 달성한다.
품질 기반 데이터 필터링 전략이 학습 데이터 크기를 줄이고 전반적인 모델 성능을 향상시킨다.
AnyDepth가 SDT를 통해 NYUv2, KITTI, ETH3D, ScanNet, DIODE에서 제로샷 설정 시 DPT에 경쟁력 있는 정확도를 달성하며, FLOPs는 낮고 추론은 비슷하거나 더 빠르다.
DySample을 사용한 프로그래시브 업샘플링은 바일리니어 업샘플링보다 고주파 세부 정보를 더 잘 보존하여 경계와 깊이 경계를 선명하게 한다.
효율성 분석은 모델 크기와 입력 해상도 전반에서 매개변수와 FLOPs의 상당한 감소를 보여주며, 추론 속도에서 미미하거나 긍정적인 이점을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.