QUICK REVIEW

[논문 리뷰] DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data

Wei Yin, Xinlong Wang|arXiv (Cornell University)|2020. 02. 03.

Advanced Vision and Imaging참고 문헌 30인용 수 34

한 줄 요약

DiverseDepth는 대규모의 다양하고 다양한 데이터셋에서 affine-invariant depth를 학습하고, 다중 커리큘럼 학습 전략을 사용하여 monocular depth prediction에서 제로샷 일반화를 강하게 달성하면서 기하학적 장면 구조를 보존합니다.

ABSTRACT

We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation, thus preserving accurate shapes of a scene. Previous methods that predict metric depth often work well only for a specific scene. In contrast, learning relative depth (information of being closer or further) can enjoy better generalization, with the price of failing to recover the accurate geometric shape of the scene. In this work, we propose a dataset and methods to tackle this dilemma, aiming to predict accurate depth up to an affine transformation with good generalization to diverse scenes. First we construct a large-scale and diverse dataset, termed Diverse Scene Depth dataset (DiverseDepth), which has a broad range of scenes and foreground contents. Compared with previous learning objectives, i.e., learning metric depth or relative depth, we propose to learn the affine-invariant depth using our diverse dataset to ensure both generalization and high-quality geometric shapes of scenes. Furthermore, in order to train the model on the complex dataset effectively, we propose a multi-curriculum learning method. Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting, demonstrating the excellent generalization capacity of the learned model to diverse scenes. The reconstructed point clouds with the predicted depth show that our method can recover high-quality 3D shapes. Code and dataset are available at: https://tinyurl.com/DiverseDepth

연구 동기 및 목표

다양한 장면 전반에 걸친 일반화 가능한 깊이 추정의 동기를 부여하고 정확한 3D 기하를 보존합니다.
강 rigid/non-rigid content 및 indoor/outdoor 장면을 포괄하는 대규모의 다양하고 다양한 RGB-D 데이터세트(DiverseDepth)를 구축합니다.
스케일과 평행 이동을 깊이와 분리하여 affine-invariant depth를 예측함으로써 스케일/이동을 깊이로부터 분리하고 더 나은 일반화를 가능하게 합니다.
다양하고 복잡한 데이터를 효과적으로 학습하기 위한 다중 커리큘럼 학습(MCL) 체계를 개발합니다.

제안 방법

Part-fore(전경), Part-in(실내 배경), Part-out(실외 배경)의 세 부분으로 구성된 DiverseDepth 데이터셋을 소개합니다.
실제 카메라 시스템과 가상 카메라 시스템 간의 스케일과 이동을 분리하여 affine-invariant depth prediction을 형성합니다.
고차 기하 제약(가상의 법선, 표면 법선)을 포함하는 손실과 스케일-이동 불변 손실(SSIL)을 결합한 손실을 사용합니다.
3개의 데이터 부분에서 난이도에 따라 데이터를 정렬하고 쉬운 배치에서 어려운 배치로 학습하는 다중 커리큘럼 학습(MCL) 전략을 채용합니다.
예측한 affine-invariant depth를 메트릭(depth)으로 다시 스케일링한 후 Abs-Rel과 WHDR 지표를 사용하여 여덟 개 데이터셋에서 제로샷 테스트로 평가합니다.

실험 결과

연구 질문

RQ1affine-invariant depth가 다양한 데이터셋에서 학습되었을 때, 이를 보지 않은 unseen 장면에 대해 metric 또는 relative depth 방법보다 일반화가 더 잘 되는가?
RQ2대규모이고 다양항한 학습 말뭉치와 구조화된 커리큘럼이 도메인 간 깊이 예측 품질을 개선하는가?
RQ3VNL/SSIL 손실을 affine-invariance와 결합하는 것이 3D 형태 재구성에 미치는 영향은 무엇인가?
RQ4제안된 방법은 전경 객체(예: 사람)와 배경 장면 간의 성능 차이가 있는가?

주요 결과

8개의 제로샷 데이터 세트에서 이전의 메트릭 깊이 및 상대 깊이 방법보다 우수한 성능을 보이며, 최대 70%의 상대 향상이 언급됩니다.
NYU에서 NYU에서 특별히 학습된 방법과 비교 가능한 성능을 달성합니다(예: Abs-Rel 11.7% vs 경쟁 방법 12.3%).
이 방법은 3D 재구성의 품질이 더 높고, 상대 깊이 베이스라인보다 장면 기하를 더 잘 보존합니다.
Ablation 결과, 다중 커리큘럼 학습이 균일 샘플링 및 역방향 커리큘럼 변형에 비해 일반화를 크게 향상시킵니다.
손실 분석에서 VNL과 SSIL이 다양 데이터셋에서 affine-invariant depth에 대해 다른 손실보다 우수한 성능을 보임을 시사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.