QUICK REVIEW

[论文解读] Monocular Depth Estimation using Diffusion Models

Saurabh Saxena, Abhishek Kar|arXiv (Cornell University)|Feb 28, 2023

Advanced Vision and Imaging被引用 24

一句话总结

本文提出 DepthGen，一种基于扩散模型的单目深度估计方法，利用自监督预训练与有监督微调，在 NYU 上达到最先进结果，在 KITTI 上具竞争力，并实现多模态深度推断和文本到3D 任务的深度插补。

ABSTRACT

We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io

研究动机与目标

将单目深度估计视为扩散建模问题，并借鉴近期扩散模型的成功。
通过填充、L1 损失和逐步展开的去噪扩散来缓解训练数据挑战（深度噪声/不完整）。
整合自监督预训练，以应对标记数据有限的问题，并实现零样本深度完成。
在室内 NYU 上展示最先进性能，在室外 KITTI 上达到有竞争力的结果。
展示多模态深度推断以及文本到3D 和新视图合成的潜力。

提出的方法

使用条件扩散模型来学习 p(y|x)，其中 x 是 RGB，y 是深度。
在 Palette 风格扩散模型上进行自监督预训练，任务为图像到图像翻译（上色、修复、去裁剪、JPEG 伪影去除）。
使用带监督的 RGB-D 数据通过 L1 损失进行微调以增强对噪声深度的鲁棒性。
通过深度填充（最近邻；户外数据的天空处理）和在微调阶段进行逐步展开去噪扩散（SUD）来解决缺失深度。
在训练期间，填充深度图中的空洞，并仅在已知像元上计算损失；推理期间，选择性地展开一步以使训练/推理分布对齐（SUD）。
评估遵循 NYU 的标准指标（REL、RMS、δ1/δ2/δ3、log10）和 KITTI 的标准指标（REL、Sq-rel、RMS、RMS log）。

Figure 1: Training Architecture. Given a groundtruth depth map, we first infill missing depth using nearest neighbor interpolation. Then, following standard diffusion training, we add noise to the depth map and train a neural network to predict the noise given the RGB image and noisy depth map. Duri

实验结果

研究问题

RQ1扩散模型是否能有效地适应从 RGB 图像进行单目深度估计？
RQ2如何通过填充、L1 损失、SUD 来提高对噪声和不完整深度数据的训练鲁棒性，以缩小训练–推理分布差距？
RQ3自监督预训练在标记数据稀缺时是否能提升深度估计，并且它如何与有监督微调结合？
RQ4基于扩散的深度模型是否支持多模态深度表示以及零样本深度完成，以用于如文本到3D 和新视图合成的下游任务？

主要发现

方法	Delta1	Delta2	Delta3	REL	RMS	log10
DepthGen (NYU) samples=1	0.944	0.986	0.995	0.075	0.324	0.032
DepthGen (NYU) samples=2	0.944	0.987	0.996	0.074	0.319	0.032
DepthGen (NYU) samples=4	0.946	0.987	0.996	0.074	0.315	0.032
DepthGen (NYU) samples=8	0.946	0.987	0.996	0.074	0.314	0.032
DepthGen (KITTI) samples=1	—	—	—	—	—	—
DepthGen (KITTI) samples=2	—	—	—	—	—	—
DepthGen (KITTI) samples=4	—	—	—	—	—	—
DepthGen (KITTI) samples=8	—	—	—	—	—	—

DepthGen 在 NYU Depth v2 上达到最先进的相对误差（REL）0.074。
DepthGen 在 KITTI 上具竞争力，在报告的指标中超过了若干基线。
消融实验显示自监督预训练和有监督深度预训练都显著提高了性能（有监督预训练的提升更大）。
深度填充对于 outdoor KITTI 至关重要，且有助于缓解孔洞；SUD 在有孔洞时进一步提升结果。
L1 损失在鲁棒性对抗噪声深度数据方面优于 L2。
DepthGen 支持多模态深度预测，能够捕捉深度不确定性（例如通过透明/反射区域）。
该模型实现了零样本深度完成，并且可以通过扩散式插补与文本到3D 流程集成。

Figure 5: Pipeline for iteratively generating a 3D scene conditioned on text $c=A\ bedroom.$ See text for details.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。