QUICK REVIEW

[論文レビュー] Monocular Depth Estimation using Diffusion Models

Saurabh Saxena, Abhishek Kar|arXiv (Cornell University)|Feb 28, 2023

Advanced Vision and Imaging被引用数 24

ひとこと要約

DepthGen は、自己教師付き pre-training と supervised fine-tuning を活用したモノキュラー深度推定の拡散モデルベースのアプローチを紹介し、NYU での最先端結果を達成し、KITTI でも競争力のある結果を示しつつ、テキスト-3D タスクのためのマルチモーダル深度推論と深度埋め込みを可能にする。

ABSTRACT

We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io

研究の動機と目的

Monocular depth estimation を diffusion modeling の問題として動機付け、最近の diffusion model の成功を活用する。
training data の課題（ノイズのある/欠損している深度）を infilling、L1 loss、そして step-unrolled denoising diffusion によって緩和する。
limited labeled data に対処するため self-supervised pre-training を導入し、zero-shot depth completion を可能にする。
Indoor NYU での最先端性能と outdoor KITTI での競争力のある結果を示す。
multimodal depth inference と text-to-3D や novel view synthesis への潜在的応用を紹介する。

提案手法

Use conditional diffusion models to learn p(y|x) where x is RGB and y is depth.
Pre-train a self-supervised Palette-style diffusion model on image-to-image translation tasks (colorization, inpainting, uncropping, JPEG artifact removal).
Fine-tune with supervised RGB-D data using L1 loss to improve robustness to noisy depth.
Address missing depth via depth infilling (nearest-neighbor; sky handling for outdoor data) and step-unrolled denoising diffusion (SUD) during fine-tuning.
During training, infill holes in depth maps and compute loss only on known pixels; during inference, optionally unroll one forward step to align training/inference distributions (SUD).
Evaluation follows standard metrics for NYU (REL, RMS, δ1/δ2/δ3, log10) and KITTI (REL, Sq-rel, RMS, RMS log).

Figure 1: Training Architecture. Given a groundtruth depth map, we first infill missing depth using nearest neighbor interpolation. Then, following standard diffusion training, we add noise to the depth map and train a neural network to predict the noise given the RGB image and noisy depth map. Duri

実験結果

リサーチクエスチョン

RQ1Can diffusion models be effectively adapted to monocular depth estimation from RGB images?
RQ2How can training with noisy, incomplete depth data be made robust (e.g., infilling, L1 loss, SUD) to close training-inference distribution gaps?
RQ3Does self-supervised pre-training improve depth estimation when labeled data is scarce, and how does it combine with supervised fine-tuning?
RQ4Can the diffusion-based depth model support multimodal depth representations and zero-shot depth completion for downstream tasks like text-to-3D and novel view synthesis?

主な発見

手法	デルタ1	デルタ2	デルタ3	REL	RMS	log10
DepthGen (NYU) samples=1	0.944	0.986	0.995	0.075	0.324	0.032
DepthGen (NYU) samples=2	0.944	0.987	0.996	0.074	0.319	0.032
DepthGen (NYU) samples=4	0.946	0.987	0.996	0.074	0.315	0.032
DepthGen (NYU) samples=8	0.946	0.987	0.996	0.074	0.314	0.032
DepthGen (KITTI) samples=1	—	—	—	—	—	—
DepthGen (KITTI) samples=2	—	—	—	—	—	—
DepthGen (KITTI) samples=4	—	—	—	—	—	—
DepthGen (KITTI) samples=8	—	—	—	—	—	—

DepthGen は NYU Depth v2 で相対誤差 (REL) が 0.074 の最先端を達成。
DepthGen は KITTI で競争力があり、報告された指標のいくつかでベースラインを上回る。
アブレーションにより、self-supervised pre-training と supervised depth pre-training の両方が性能を大幅に向上させることが示され、特に supervised pre-training の寄与が大きい。
Depth infilling は outdoor KITTI にとって重要で、穴を緩和するのに役立ち、SUD は特に穴がある場合に結果をさらに改善する。
L1 loss はノイズのある深度データに対する頑健性で L2 を上回る。
DepthGen はマルチモーダルな深度予測をサポートし、透明/反射領域などの深度不確定性を捉える。
このモデルは zero-shot depth completion を可能にし、diffusion-based imputation を介して text-to-3D パイプラインと統合できる。

Figure 5: Pipeline for iteratively generating a 3D scene conditioned on text $c=A\ bedroom.$ See text for details.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。