QUICK REVIEW

[論文レビュー] Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

Fei Shen, Ye Hu|arXiv (Cornell University)|Oct 10, 2023

Generative Adversarial Networks and Image Synthesis被引用数 18

ひとこと要約

PCDMs は、 pose-guided person image synthesis のために source および target pose を橋渡しする三段階の段階的拡散フレームワークを導入し、現実感と一貫性を向上させる。

ABSTRACT

Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis. However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge. This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/tencent-ailab/PCDMs.

研究の動機と目的

source と target pose が異なる場合の pose-guided person image synthesis の改善を動機付ける。
appearance、pose、および texture を段階的に整合させる三段階拡散フレームワークを提案する。
global feature、dense correspondence、texture refinement を活用してフォトリアリズムを高める。
公開データセットでの定量・定性的結果の優位性を示し、下流タスクを評価する。

提案手法

Stage 1: Prior conditional diffusion model predicts the target image global features from pose coordinates and source image using a transformer and CLIP-based embeddings.
Stage 2: Inpainting conditional diffusion model uses global target features to establish dense source-target correspondences and generate a coarse-skeleton image.
Stage 3: Refining conditional diffusion model uses the coarse image to restore textures and fine details via texture-guided diffusion with cross-attention.
Classifier-free guidance is employed to balance fidelity and diversity; latent diffusion and CLIP-based embeddings are leveraged.
Three-stage progression converts unaligned input generation into aligned, high-quality synthesis.

Figure 1: (a) Existing methods typically utilize unaligned image-to-image generation at the conditional level. (b) Our approach progressively predicts the global features, dense correspondences, and texture restoration of target image, enabling image synthesis.

実験結果

リサーチクエスチョン

RQ1Can a three-stage progressive diffusion framework effectively bridge the gap between source and target poses for person image synthesis?
RQ2Do global feature prediction, dense correspondence, and texture refinement yield improvements over single-stage approaches?
RQ3How does PCDMs perform on standard benchmarks and in downstream tasks like person re-identification?

主な発見

Dataset	Methods	SSIM (↑)	LPIPS (↓)	FID (↓)
DeepFashion (256×176)	Def-GAN	0.6786	0.2330	18.457
DeepFashion (256×176)	PATN	0.6709	0.2562	20.751
DeepFashion (256×176)	ADGAN	0.6721	0.2283	14.458
DeepFashion (256×176)	PISE	0.6629	0.2059	13.610
DeepFashion (256×176)	GFLA	0.7074	0.2341	10.573
DeepFashion (256×176)	DPTN	0.7112	0.1931	11.387
DeepFashion (256×176)	CASD	0.7248	0.1936	11.373
DeepFashion (256×176)	NTED	0.7182	0.1752	8.6838
DeepFashion (256×176)	PIDM	0.7312	0.1678	6.3671
DeepFashion (256×176)	PCDMs (Ours)	0.7444	0.1365	7.4734
DeepFashion (512×352)	CocosNet2	0.7236	0.2265	13.325
DeepFashion (512×352)	NTED	0.7376	0.1980	7.7821
DeepFashion (512×352)	PIDM	0.7419	0.1768	5.8365
DeepFashion (512×352)	PCDMs (Ours)	0.7601	0.1475	7.5519
Market-1501	Def-GAN	0.2683	0.2994	25.364
Market-1501	PTN	0.2821	0.3196	22.657
Market-1501	GFLA	0.2883	0.2817	19.751
Market-1501	DPTN	0.2854	0.2711	18.995
Market-1501	PIDM	0.3054	0.2415	14.451
Market-1501	PCDMs (Ours)	0.3169	0.2238	13.897

PCDMs achieve higher SSIM and lower LPIPS on DeepFashion and Market-1501 compared to several SOTA methods.
On DeepFashion 256x176, PCDMs achieve SSIM 0.7444, LPIPS 0.1365, FID 7.4734, outperforming many baselines.
On DeepFashion 512x352, PCDMs achieve SSIM 0.7601, LPIPS 0.1475, FID 7.5519, exceeding several competitors.
On Market-1501, PCDMs achieve SSIM 0.3169, LPIPS 0.2238, FID 13.897, surpassing multiple methods.
User studies indicate favorable real-image misclassification rates and preferences for PCDMs.
Refining diffusion improves results across other SOTA methods, showing universality.

Figure 2: The three-stage pipeline of P rogressive C onditional D iffusion M odel s (PCDMs) progressively operates to generate the final high-quality and high-fidelity synthesized image.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。