QUICK REVIEW

[논문 리뷰] ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl|arXiv (Cornell University)|2023. 02. 23.

Advanced Vision and Imaging인용 수 168

한 줄 요약

ZoeDepth는 상대 깊이 사전 학습(relative-depth pre-training)과 메트릭-깊이 헤드(metric-depth heads) 및 자동 라우팅을 결합하여 강한 메트릭-깊이 성능과 실내/실외 데이터셋에 걸친 전례 없는 제로샷 일반화를 달성합니다.

ABSTRACT

This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .

연구 동기 및 목표

메트릭-깊이 모델이 단일 데이터셋에 과적합하고 도메인 간 일반화가 부족하다는 한계를 해결한다.
미세조정 중 메트릭 깊이를 보존하면서 일반화를 개선하기 위해 상대 깊이 사전 학습을 활용한다.
경량화된 도메인별 메트릭-깊이 헤드(metric bins module)와 추론 시 적절한 헤드를 자동으로 선택하는 라우팅 메커니즘을 개발한다.
NYU Depth v2와 KITTI에서 개선된 최신 성능을 입증하고, 8개 미 unseen 데이터셋에 대한 강한 제로샷 일반화를 보여준다.

제안 방법

두 단계 프레임워크: 먼저 MiDaS 전략을 사용하여 상대 깊이(RDE)에 대한 공통 인코더-디코더를 사전 학습한 다음, 메트릭 깊이 헤드를 추가하고 메트릭-깊이 데이터셋에서 미세조정한다.
픽 per-pixel 깊이 bin 중심을 예측하고 per-pixel bin 확률과 결합하여 메트릭 깊이를 출력하는 attractor 계층이 있는 metric bins module(MBM)을 도입한다.
깊이 순서를 존중하고 안정성을 향상시키기 위해 깊이 빈 예측을 이항 순서 확률 모델(binomial-ordered probability model)로 대체한다.
다중 스케일 디코더 특징을 사용하여 역 attractor 계층을 통해 bin 중심을 다듬고, bin을 분할하기보다 학습된 attractor 쪽으로 이동시킨다.
자동 라우팅: 추론 시 각 이미지를 엔코더 특징으로 학습된 잠재 분류기에 의해 적절한 메트릭 헤드로 라우팅한다; 단일 헤드 또는 인도어/아웃도어 구성의 다중 헤드를 지원한다.
스케일 불변 픽셀 손실로 감독 학습하고 Chamfer 손실에 의존하지 않으며 메모리 제약을 고려하여 픽셀 단위의 스케일 불변 손실에 집중한다.

실험 결과

연구 질문

RQ1단일 모델이 상대 깊이에서 사전 학습되었을 때 여러 도메인(실내/실외) 전반에 걸쳐 메트릭 깊이에 일반화할 수 있으며 메트릭 정확도를 희생하지 않는가?
RQ2가볍고 도메인별 메트릭 헤드(MBM 및 attractors)로 cross-domain 일반화를 유지하면서 메트릭 스케일을 효과적으로 회복하는가?
RQ3도메인별 헤드로의 자동 라우팅이 보지 못한 데이터셋에 대한 제로샷 일반화에 어떤 영향을 미치는가?

주요 결과

Method	δ1	δ2	δ3	REL	RMSE	log10
NeWCRFs [50]	0.922	0.992	0.998	0.095	0.334	0.041
ZoeD-X-N	0.946	0.994	0.999	0.082	0.294	0.035
ZoeD-M12-N	0.955	0.995	0.999	0.075	0.270	0.032
ZoeD-M12-NK	0.953	0.995	0.999	0.077	0.277	0.033

ZoeD-X-N은 상대 깊이 사전 학습 없이도 NYU Depth v2에서 이미 최첨단을 능가한다(REL은 NeWCRFs 대비 13.7% 향상).
ZoeD-M12-N(12개 데이터셋에 대한 상대 사전 학습 + NYU에서의 메트릭 미세조정)은 NYU Depth v2에서 이전 SOTA 대비 약 21% REL 향상을 달성한다.
ZoeD-M12-NK(NYU와 KITTI에서 다중 도메인 미세조정 및 실내/실외 헤드로의 라우팅)은 NeWCRFs 대비 전체 REL 24.3% 향상과 8개 보지 않은 데이터셋에서 강력한 제로샷 결과를 달성한다.
unseen indoor datasets에서 제로샷 일반화는 mRIθ가 최대 46.3%까지 나타나고(DIODE Indoor 예) 이전 방법보다 지속적으로 더 나은 성능을 보인다.
unseen outdoor datasets에서 제로샷 일반화는 DIML Outdoor에서 최대 976.4% mRIθ에 도달하고 Virtual KITTI 2 및 DDAD 같은 다른 데이터셋에서도 강력한 결과를 보인다.
이 접근법은 실내/실외 데이터셋 간의 학습 시 큰 성능 저하 없이 강력한 다도메인 학습을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.