QUICK REVIEW

[論文レビュー] A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Junyi Zhang, Charles Herrmann|arXiv (Cornell University)|May 24, 2023

Generative Adversarial Networks and Image Synthesis被引用数 51

ひとこと要約

本論文は Stable Diffusion の特徴と DINOv2 を組み合わせることで、SPair-71k、PF-Pascal、TSS において従来手法を上回るゼロショットの意味的および密な対応を実現し、タスク特化の訓練を必要とせずに高品質なインスタンススワッピングを可能にすることを示している。

ABSTRACT

Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.

研究の動機と目的

Stable Diffusion (SD) の特徴を用いて、画像間の意味的および密な対応を実現する方法を探る。
対応タスクにおける SD の特徴と DINOv2 の特徴の補完的な長所と短所を分析する。
SD と DINOv2 の特徴を組み合わせるための簡単な融合戦略を設計する。
標準ベンチマーク（SPair-71k、PF-Pascal、TSS）におけるゼロショットおよび監督ありの性能を評価する。

提案手法

Stable Diffusion のエンコーダ-デコーダー U-Net から SD 特徴を抽出し、デコーダ層（2、5、8）を跨いで集約し、PCA を用いて次元削減する。
最終層（11番目）の DINOv2 トークン特徴を意味記述子として用いる。
固定重み alpha（alpha=0.5）で SD と DINO の特徴を正規化して結合し、最近傍対応のための融合特徴を形成する。
SPair-71k、PF-Pascal、TSS データセットで、スパースおよびデンスな対応に対するゼロショット最近傍マッチングを評価する。
融合密な対応と SD ベースの refinement を用いた Stable Diffusion ベースの画像間翻訳によるインスタンススワッピングのパイプラインを実証する。

実験結果

リサーチクエスチョン

RQ1SD の特徴 alone は、異なるオブジェクトやインスタンス間で競争力のある意味的および密な対応を提供できるか。
RQ2SD と DINOv2 の特徴は補完的な強みを提供し、融合を通じて対応性能を改善できるか。
RQ3単純な融合戦略は標準の密な対応および意味的対応ベンチマークで最先端のゼロショット性能をもたらすか。
RQ4融合表現は、タスク特化の微調整なしに、インスタンススワッピングなどの高品質な下流タスクを可能にするか。

主な発見

ゼロショットの融合 SD+DINOv2 特徴は、SPair-71k で平均 64.0 のリーディング PCK スコアを達成し、従来手法を上回る。
SPair-71k におけるゼロショット設定で、DINOv2 のベースラインから 8.4 ポイント改善（55.6 から 64.0）を達成。
PF-Pascal では、融合特徴は無監視ベースラインを上回り、タスク固有手法と同等またはそれを超え、しきい値を跨いで一貫した利益を示す。
TSS では、融合特徴は無監視最近傍法を上回り、SD の空間的一貫性により大幅な利得を示す。
融合した対応を用いたインスタンススワッピングは、数量評価において SD または DINOv2 のみを使用した場合より、CLIP および知覚品質スコアが高く、FID が低い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。