QUICK REVIEW

[논문 리뷰] Unsupervised Semantic Correspondence Using Stable Diffusion

Eric Hedlin, Gopal Sharma|arXiv (Cornell University)|2023. 05. 24.

Generative Adversarial Networks and Image Synthesis인용 수 22

한 줄 요약

본 논문은 Stable Diffusion을 활용한 무감독 방법으로 의미적 대응을 제시하여, 여러 데이터셋에서 competitive한 PCK 점수를 달성하고 이전의 무감독 baselines보다 향상된다.

ABSTRACT

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences - locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

연구 동기 및 목표

생성적 확산 모델을 통해 무감독 의미 대응의 동기를 제시한다.
Stable Diffusion에 의해 안내되는 임베딩 최적화 접근법을 제안하여 이미지 간 의미 부분을 정렬한다.
표준 벤치마크(CUB-200, PF-Willow, SPair-71k)에서 감독 및 약지도 감독기반 기준선과 비교 평가한다.

제안 방법

Stable Diffusion 잠재 공간에서 임베딩을 최적화하여 의미 부분을 정렬한다.
확산 모델의 어텐션 맵과 토큰 수준의 추론을 활용하여 대응 영역을 찾는다.
데이터셋 전반에서 PCK 지표를 사용하여 강한 감독, 약지도 감독, 무감독 기준선과 비교한다.

Figure 4 : Attention maps for each of the tokens corresponding to the sentence "A picture of a cat"

실험 결과

연구 질문

RQ1확실한 감독 없이 확산 기반 임베딩을 사용하여 무감독 의미 대응을 얼마나 잘 달성할 수 있는가?
RQ2확산 기반 표현이 표준 벤치마크에서 기존의 무감독 및 약지도 방법과 비교하여 경쟁력 있는 PCK 성능을 제공하는가?
RQ3토큰 수준 어텐션과 최적화된 임베딩이 대응 정확도에 미치는 영향은 무엇인가?
RQ4제안된 방법이 CUB-200, PF-Willow, SPair-71k에서 DINO+MLS, VGG+MLS, PWarpC-NC-Net 와 같은 기준선 대비 얼마나 순위가 높은가?

주요 결과

Our method achieves 61.6 PCK@0.05 and 77.5 PCK@0.1 on CUB-200.
Our method achieves 53.0 PCK@0.05 and 84.3 PCK@0.1 on PF-Willow.
Our method achieves 28.9 PCK@0.05 and 45.4 PCK@0.1 on SPair-71k.
Compared to prior unsupervised baselines (DINO+NN), our method improves on reported scores for the considered datasets.
The table shows competitive performance across datasets, with strong gains on PF-Willow and SPair-71k in certain metrics.
PWarpC-NC-Net and other baselines exhibit varying strengths; our method consistently ranks above several unsupervised baselines.

Figure 5 : Attention maps for each of the tokens corresponding to the sentence "A bird’s left eye"

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.