QUICK REVIEW

[论文解读] The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Saurabh Saxena, Charles Herrmann|arXiv (Cornell University)|Jun 2, 2023

Advanced Vision and Imaging被引用 20

一句话总结

论文表明通用去噪扩散模型在光流和单目深度估计任务上无需特定任务架构即可达到最先进或具有竞争力的结果，同时能够进行不确定性估计和多模态预测。

ABSTRACT

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.

研究动机与目标

证明通用扩散模型在光流和单目深度估计上无需专门的架构或损失即可解决问题。
通过填充、逐步展开去噪以及L1去噪损失，解决来自嘈杂/不完整地面实况数据的挑战。
展示合成-real数据混合如何提升深度和流任务的预训练与泛化。
说明模型捕捉不确定性和多模态能力，并实现自粗到自细的细化与填充。

提出的方法

将深度和流估计表述为带有条件去噪扩散模型的图像到图像翻译。
在训练过程中对缺失地面实况值进行填充，以缓解数据噪声。
在微调阶段应用逐步展开的去噪，以减少训练-推断分布偏移。
引入L1去噪器损失以提高对嘈杂地面实况的鲁棒性。
在推断阶段采用自粗到自细的细化以实现高分辨率输出。
在合成数据与真实数据混合上进行预训练，以提高泛化能力。

Figure 2 : Training architecture . Given ground truth flow/depth, we first infill missing values using interpolation. Then, we add noise to the label map and train a neural network to model the conditional distribution of the noise given the RGB image(s), noisy label, and time step. One can optional

实验结果

研究问题

RQ1在没有任务特定架构的情况下，通用扩散模型是否能在光流和单目深度估计上达到最先进的结果？
RQ2如何使带有嘈杂/不完整地面实况的训练对于基于扩散的密集预测任务稳定？
RQ3多任务自监督与合成-真实预训练是否提升深度与光流的性能和泛化？
RQ4模型在这些任务中捕捉不确定性与多模态的能力如何？
RQ5扩散模型是否能在密集预测中支持自粗到自细的细化与缺失值插补？

主要发现

DDVM 在 NYU 室内深度估计上实现相对深度误差为0.074 的最先进水平。
在 KITTI 光流上，模型的 Fl-all 异常率为3.26%，比已发表的最佳方法高约25%的改进。
在 Sintel 和 KITTI 的零-shot 光流结果，在使用合成数据混合（AutoFlow、FlyingThings、Kubric、TartanAir）进行预训练后，超越强基线。
扩散模型通过多样本捕捉多模态性和不确定性，特别是在反射、半透明或含糊区域。
自粗到自细的细化和填充策略显著提升流和深度的准确性，降低 AEPE 并改善 KITTI/Sintel 指标。
在 KITTI 光流的零-shot与微调设置下，该模型超越 FlowFormer，并在 Sintel 的多数情况也超越，亦有一些例外情况另文讨论。

Figure 4 : Visual results comparing RAFT with our method after pretraining . Note that our method does much better on fine details and ambiguous regions.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。