QUICK REVIEW

[논문 리뷰] Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry

Yong-Hyun Park, Mingi Kwon|arXiv (Cornell University)|2023. 07. 24.

Advanced Neuroimaging Techniques and Applications인용 수 13

한 줄 요약

이 논문은 풀백 메트릭을 사용하여 확산 모델의 잠재 공간을 분석하고, 지역 잠재 기저를 도출하며, 단일 타임스텝에서 x-공간 편집을 가능하게 하고, 타임스텝 및 프롬프트에 따른 기하학적 변화를 연구한다.

ABSTRACT

Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Our approach involves deriving the local latent basis within $\mathcal{X}$ by leveraging the pullback metric associated with their encoding feature maps. Remarkably, our discovered local latent basis enables image editing capabilities by moving $\mathbf{x}_t$, the latent space of DMs, along the basis vector at specific timesteps. We further analyze how the geometric structure of DMs evolves over diffusion timesteps and differs across different text conditions. This confirms the known phenomenon of coarse-to-fine generation, as well as reveals novel insights such as the discrepancy between $\mathbf{x}_t$ across timesteps, the effect of dataset complexity, and the time-varying influence of text prompts. To the best of our knowledge, this paper is the first to present image editing through $\mathbf{x}$-space traversal, editing only once at specific timestep $t$ without any additional training, and providing thorough analyses of the latent structure of DMs. The code to reproduce our experiments can be found at https://github.com/enkeejunior1/Diffusion-Pullback.

연구 동기 및 목표

확산 모델의 잠재 공간을 앞선 노이즈 예측을 넘어 이해하려는 동기를 제시한다.
풀백 메트릭을 통해 X에서 지역 잠재 기저를 정의하기 위한 리만 기하 프레임워크를 도입한다.
고정된 타임스텝에서 발견된 기저 벡터를 따라 X를 이동시켜 이미지 편집을 시연한다.
확산 타임스텝과 서로 다른 텍스트 프롬프트 하에서 잠재 기하가 어떻게 변하는지 분석한다.
추가 학습 없이 단일 타임스텝 조작으로 편집이 가능함을 입증한다.

제안 방법

X와 H 사이의 야코비안 Jx와 특징 공간 H의 유클리드 구조를 사용하여 풀백 메트릭을 정의한다( H는 U-Net의 병목 지점).
Jx의 오른쪽 상단 특이 벡터들로 로컬 잠재 기저 {vi}를 계산한다(SVD나 파워 메서드 활용).
x-space 가이던스를 이용하여 기저 벡터 방향으로 x를 섭동하고 epsilon-모형 차이를 적용하여 잠재 x를 편집한다: x̃XG = x + γ[ϵθ(x+v) − ϵθ(x)].
다른 x 샘플의 접선 공간 간에 로컬 기저 벡터를 전달하기 위해 H에서의 평행전송을 적용하여 교차 샘플 편집을 가능하게 한다.
추가 학습 없이 편집을 실현하기 위해 DDIM 역산과 생성을 수행한다.
원하면 텍스트 프롬프트에 기저를 조건화하여 의미론적으로 의미 있는 편집 방향을 얻는다.

Figure 1: Conceptual illustration of local geometric structure. (a) The local basis $\{\mathbf{v}_{1},\mathbf{v}_{2},\cdots\}$ of the local latent subspace $\mathcal{T}_{{\mathbf{x}}_{t}}$ within the latent space $\mathcal{X}$ is interlinked with the local basis $\{\mathbf{u}_{1},\mathbf{u}_{2},\cdo

실험 결과

연구 질문

RQ1확산 모델의 잠재 공간 X에 의미 있는 지역 기하 구조를 부여할 수 있는가?
RQ2풀백 기하학을 통해 발견된 로컬 잠재 기저가 추가 학습 없이 의미론적으로 의미 있는 이미지 편집을 가능하게 하는가?
RQ3잠재 구조가 확산 타임스텝에 따라 어떻게 변하고 데이터셋의 복잡성 및 프롬프트에 따라 어떻게 달라지는가?
RQ4특징 공간에서의 평행전송을 통해 편집 방향을 샘플 간에 어느 정도까지 전달할 수 있는가?
RQ5텍스트 조건화가 잠재 공간/접선 공간의 기하에 어떤 영향을 미치는가?

주요 결과

X와 H 사이의 야코비안으로 풀백 메트릭을 사용하여 X의 로컬 잠재 기저를 찾을 수 있다.
기저 벡터를 따라 움직이는 것은 특정 타임스텝에서 추가 학습 없이 의미론적으로 의미 있는 이미지 편집을 낳는다.
생성 중 잠재 기저가 저주파 성분에서 고주파 성분으로 이동함을 보여주며 거칠은-세밀한(coarse-to-fine) 동작을 확인한다.
확산이 진행될수록 서로 다른 샘플의 접선 공간이 더 서로 달라지며 데이터셋의 복잡성에 의존한다.
유사한 프롬프트는 비슷한 접선 공간을 야기하며, 타임스텝이 큰 단계로 갈수록 프롬프트의 영향은 감소한다.
H에서의 평행전송은 접선 공간이 충분히 일치할 때 샘플 간에 편집 방향을 전달할 수 있게 한다.

Figure 2: Image editing with the discovered latent basis. (a) Schematic depiction of our image editing procedure. ① An input image is subjected to DDIM inversion, resulting in an initial noisy sample $\mathbf{x}_{T}$ . ② The sample $\mathbf{x}_{T}$ is progressively denoised until reaching the point

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.