QUICK REVIEW

[논문 리뷰] Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov|arXiv (Cornell University)|2023. 12. 04.

Advanced Vision and Imaging인용 수 13

한 줄 요약

Marigold는 사전 학습된 Stable Diffusion 잠재 확산 모델을 미세 조정하여 아핀-불변인 단일 이미지 깊이 추정을 수행하고, 합성 학습 데이터를 사용한 제로샷 일반화를 달성하며, 여러 실제 데이터 세트에서 최신 성능을 보여준다.

ABSTRACT

Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.

연구 동기 및 목표

확산 모델의 풍부한 사전 정보를 활용해 단안(depth) 깊이 추정 일반화를 개선하는 것을 목표로 한다.
깊이 추정을 위해 사전 학습된 이미지 생성기를 재목적으로 활용할 수 있도록 자원 효율적인 미세 조정 프로토콜을 개발한다.
훈련 중 실제 깊이 데이터 없이도 미지의 실제 데이터 세트에 일반화되는 아핀-불변 깊이 추정을 달성한다.

제안 방법

Stable Diffusion에 기반한 잠재 확산 모델(LDM)을 사용하고 노이즈 제거 U-Net만 미세 조정한다.
입력 RGB 이미지와 깊이 모두를 VAE 잠재 공간에 인코드하고 결합된 잠재 코드로 노이즈 제거기를 조건화한다.
잠재 공간에서 표준 확산 목표를 사용하고 아핀-불변 깊이 정규화를 적용하여 합성 RGB-D 데이터로 학습한다.
DDIM 유사 샘플링 및 다중 확률적 패스에 걸친 테스트 시점 앙상블을 포함한 보강 추론 스킴을 적용한다.
수렴 및 일반화를 향상시키기 위해 학습 중 다중 해상도 어닐링 노이즈를 활용한다.

Figure 2 : Overview of the Marigold fine-tuning protocol. Starting from a pretrained Stable Diffusion, we encode the image $\mathbf{x}$ and depth $0pt$ into the latent space using the original Stable Diffusion VAE. We fine-tune just the U-Net by optimizing the standard diffusion objective relative t

실험 결과

연구 질문

RQ1미지의 실제 데이터 세트에 대한 제로샷 일반화를 달성할 수 있을 만큼 프리트레인된 확산 모델의 풍부한 시각 priors를 깊이 추정으로 재활용할 수 있는가?
RQ2합성 데이터를 사용해 확산 모델을 효율적으로 미세 조정하여 아핀-불변 깊이 맵을 생성하는 방법은 무엇인가?
RQ3조건화, 정규화 및 추론 전략이 미지의 실제 데이터 세트에 대한 제로샷 일반화에 어떤 영향을 미치는가?

주요 결과

Marigold는 훈련 중 실제 깊이 맵을 한 번도 보지 않고도 여러 실제 데이터 세트에서 최첨단의 아핀-불변 깊이 추정을 달성한다.
제안된 프로토콜로 합성 데이터셋(Hypersim 및 Virtual KITTI)에서 학습하면 실내/실외 장면에 대한 강력한 제로샷 전이 성능을 얻는다.
다중 해상도 노이즈와 어닐링 및 테스트 시 앙상블은 깊이 정확도와 견고성을 향상시킨다.
단일 예측으로도 이미 우수한 성능을 보이며, 10–20회의 앙상블로 AbsRel 감소 및 δ1 정확도가 더 향상된다.
이 방법은 소비자급 GPU에서 몇 GPU-일, 그리고 RTX 4090에서 2.5일 정도의 학습 시간으로 수렴한다.

Figure 3 : Overview of the Marigold inference scheme. Given an input image $\mathbf{x}$ , we encode it with the original Stable Diffusion VAE into the latent code $\mathbf{z}^{(\mathbf{x})}$ , and concatenate with the depth latent $\mathbf{z}^{(0pt)}_{t}$ before giving it to the modified fine-tuned

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.