QUICK REVIEW

[論文レビュー] The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Saurabh Saxena, Charles Herrmann|arXiv (Cornell University)|Jun 2, 2023

Advanced Vision and Imaging被引用数 20

ひとこと要約

本論文は、一般的なノイズ除去拡散モデルが、タスク特化型のアーキテクチャを必要とせず、光学フローと単眼深度推定で最先端または競合的な結果を達成できることを示しつつ、不確実性推定とマルチモーダル予測を可能にする。

ABSTRACT

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.

研究の動機と目的

一般的な拡散モデルが、特殊なアーキテクチャや損失関数を用いずに、光学フローと単眼深度推定を解決できることを示す。
ノイズのある/不完全なグラウンドトゥルースによるデータ課題に対し、欠測埋め、ステップ展開デノイジング、およびL1デノイザ損失を用いて対処する。
深度とフロータスクの事前学習と一般化を改善するために、合成データと実データの混合がどのように効果的かを示す。
不確実性とマルチモーダリティを捉えるモデルの能力を示し、粗さ/細かな階層での Refinement と欠損値の補完を可能にする。

提案手法

深度とフロー推定を、条件付きデノイジング拡散モデルを用いた画像-to-画像変換として定式化する。
トレーニング時に欠損グラウンドトゥルース値の欠測埋めを用いてデータノイズを緩和する。
微調整時にステップ展開デノイジングを適用して、訓練時と推論時の分布シフトを低減する。
ノイズのあるグラウンドトゥルースに対する頑健性のためL1デノイザ損失を組み込む。
推論時に粗から細へのリファインメントを適用して高解像度の出力を実現する。
合成データセットと実データの混合に対して事前学習を行い、一般化を向上させる。

Figure 2 : Training architecture . Given ground truth flow/depth, we first infill missing values using interpolation. Then, we add noise to the label map and train a neural network to model the conditional distribution of the noise given the RGB image(s), noisy label, and time step. One can optional

実験結果

リサーチクエスチョン

RQ1一般的な拡散モデルは、タスク特化型アーキテクチャなしで光学フローと単眼深度推定において最先端の結果を達成できるか？
RQ2ノイズのある/不完全なグラウンドトゥルースでの訓練を、拡散ベースの密な予測タスクに対していかに安定化できるか？
RQ3深度とフローの性能と一般化を、マルチタスク自己監視と合成-実データ前訓練が高めるか？
RQ4これらのタスクにおける不確実性とマルチモダリティを捉えるモデルの能力はどれくらいか？
RQ5拡散モデルは密な予測において粗から細へのリファインメントと欠損値補完をサポートできるか？

主な発見

DDVMはNYU室内深度推定で相対深度誤差0.074の最先端を達成。
KITTI光学フローでは、Fl-allアウトライヤー率3.26%を達成し、公開済み手法のベストより約25%良い。
SintelとKITTIでのゼロショット光学フローは、合成データ混合（AutoFlow, FlyingThings, Kubric, TartanAir）で事前学習した後、強力なベースラインを上回る。
拡散モデルは、反射的・半透明・曖昧な領域で特に、複数サンプルを介してマルチモダリティと不確実性を捉える。
粗から細へのリファインメントと欠測埋め戦略は、フローと深度の精度を大幅に改善し、AEPEを低減し、KITTI/Sintel指標を改善する。
本モデルはKITTI光学フローでゼロショットおよび微調整設定でFlowFormerを上回り、Sintelの多くのケースでも上回るが、いくつかの例外を議論。

Figure 4 : Visual results comparing RAFT with our method after pretraining . Note that our method does much better on fine details and ambiguous regions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。