QUICK REVIEW

[論文レビュー] Emergent Correspondence from Image Diffusion

Luming Tang, Menglin Jia|arXiv (Cornell University)|Jun 6, 2023

Advanced Neuroimaging Techniques and Applications被引用数 54

ひとこと要約

研究は拡散モデルが暗黙のうちにピクセル対応を学習し、Diffusion Features (DIFT) を導入して実画像上のセマンティック、幾何学、時間的対応を監視なしで確立する。

ABSTRACT

Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io

研究の動機と目的

Image diffusion models が明示的な監督なしで対応を学習することを実証する。
事前学習済みの拡散モデルから実画像の高密度な拡散ベースの特徴量（DIFT）を抽出する。
DIFT をセマンティック、幾何学、時間的対応のベンチマークで評価し、自己教師ありおよび教師あり法と比較する。
編集伝搬やクロスドメインマッチングといった実用的な応用を示す。

提案手法

事前学習済みの拡散モデル（Stable Diffusion 2-1 および Ablated Diffusion Model）を用いて、逆拡散過程の中間層の活性化を抽出する。
実画像にノイズを加えて拡散タイムステップ t に至らせ、拡散モデルに通すことで DIFT を得るように実画像の特徴を近似する。
タスクごとに 2D グリッド探索でタイムステップ t とネットワーク層を選択し、セマンティック情報と低レベル情報のバランスを取る。
複数のノイズ付き実現にわたって特徴を平均化し、安定性を向上させる。
ピクセル特徴間のコサイン距離で最近傍マッチングを行い、対応を確立する。
微調整や監督なしで、セマンティック、幾何学、時間的対応のベンチマークを横断して評価する。

Figure 1 : Given a red source point in an image (far left), we would like to develop a model that automatically finds the corresponding point in the images on the right. Without any fine-tuning or correspondence supervision, our proposed diffusion features (DIFT) could establish semantic corresponde

実験結果

リサーチクエスチョン

RQ1画像合成のために訓練された拡散モデルは、クロスイメージマッチングに有用なピクセルレベルの対応を暗黙のうちに獲得できるか。
RQ2対応タスクのために微調整せずに、拡散モデルから高密度な実画像特徴をどのように抽出できるか。
RQ3異なる対応タスク（セマンティック、幾何学、時間的）において、最適な拡散タイムステップ t と層は何か。
RQ4標準的な対応ベンチマークで DIFT は自己教師ありおよび教師あり法とどのように比較されるか。
RQ5DIFT は監督なしで編集伝搬や動画/物体追跡といったタスクを実現できるか。

主な発見

DIFT は実画像上で、タスク固有の監督なしで、堅牢なセマンティック、幾何学、時間的対応を実現する。
SPair-71k で、DIFT sd はセマンティック PCK において DINO および OpenCLIP をそれぞれ19点、14点上回り；18カテゴリ中9カテゴリのセマンティック対応で教師あり法に対して競争力がある。
DIFT は PF-WILLOW および SPair-71k のベンチマークで弱教師ありベースラインおよび自己监督特徴より優れており、セマンティックな結果は時に最先端の教師あり手法を上回る。
HPatches では幾何学的対応性能が高く、SuperPoint キーポイントを用いた場合、幾何学監督手法と同等程度である。
時間的タスクでは、ビデオ特有のトレーニングを行わずに DAVIS-2017 および JHMDB で最先端または高い成果を達成し、豊かなクロスドメイン一般化を示す。
編集伝搬の実験では、DIFT ベースのマッチングが OpenCLIP よりも跨画像編集の精度を高めることが示されている。

Figure 2 : Given a Stable Diffusion generated image, we extract its intermediate layer activations at a certain time step $t$ during its backward process, and use them as the feature map to predict the corresponding points. Although simple, this method produces correct correspondences on generated i

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。