QUICK REVIEW

[論文レビュー] Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov|arXiv (Cornell University)|Dec 4, 2023

Advanced Vision and Imaging被引用数 13

ひとこと要約

Marigoldは事前学習済みのStable Diffusionの潜在拡散モデルを微調整してアフィン不変な単眼深度推定を実行し、synthetic training dataを用いたトレーニングでゼロショット一般化と複数の実データセットでの最先端の結果を達成します。

ABSTRACT

Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.

研究の動機と目的

拡散モデルにおける豊富な priors を活用して単眼深度推定の一般化性能を向上させる動機づけ。
事前学習済みの画像生成モデルを深度推定用に再利用するための資源効率の高いファインチューニング手順の開発。
訓練時に実深度データを用いず unseen real-world datasets に一般化するアフィン不変な深度推定を実現。

提案手法

Stable Diffusion に基づく潜在拡散モデル (LDM) を用い、denoising U-Net のみを微調整する。
入力RGB画像と深度をともにVAE潜在空間にエンコードし、連結した潜在コードを条件としてデノイザーを適合させる。
affine-invariant depth normalization および潜在空間での標準的な拡散目的関数を用いて合成RGB-Dデータで訓練する。
DDIM様のサンプリングとテスト時アンサンブリングを組み合わせた拡張推論方式を適用する。
訓練時に多分解能のアニーリングノイズを使用して収束と一般化を改善する。

Figure 2 : Overview of the Marigold fine-tuning protocol. Starting from a pretrained Stable Diffusion, we encode the image $\mathbf{x}$ and depth $0pt$ into the latent space using the original Stable Diffusion VAE. We fine-tune just the U-Net by optimizing the standard diffusion objective relative t

実験結果

リサーチクエスチョン

RQ1事前学習済み拡散モデルの豊富な視覚 priors を Broadly generalizable に1枚の画像から depth を推定する用途に再利用できるか？
RQ2拡散モデルを synthetic data で効率的に微調整してアフィン不変な深度マップを生成できるか？
RQ3 conditioning, normalization, inference strategy が unseen real datasets へのゼロショット一般化にどのような影響を与えるか？

主な発見

Marigold は訓練時に実深度マップを一切見せることなく、いくつかの実データセットでアフィン不変な深度推定の最先端を達成した。
提案手法を用いた synthetic データセット（Hypersim および Virtual KITTI）での訓練は、室内外シーンへの強いゼロショット転移を生み出す。
アニーリングを伴う多分解能ノイズとテスト時アンサンブリングは深度の精度と堅牢性を向上させる。
単一の予測で良好な性能を示し、10–20 回のアンサンブルにより AbsRel のさらなる低減と δ1 の高精度を得られる。
本手法はコンシューマーGPU で数日、RTX 4090 で約2.5日程度の学習時間で収束するだけの計算資源で済む。

Figure 3 : Overview of the Marigold inference scheme. Given an input image $\mathbf{x}$ , we encode it with the original Stable Diffusion VAE into the latent code $\mathbf{z}^{(\mathbf{x})}$ , and concatenate with the depth latent $\mathbf{z}^{(0pt)}_{t}$ before giving it to the modified fine-tuned

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。