[Paper Review] Unsupervised Person Image Generation with Semantic Parsing Transformation
This paper proposes an unsupervised person image generation framework that decomposes pose-guided image synthesis into two stages: semantic parsing transformation and appearance generation. By training these components end-to-end with cycle consistency and a semantic-aware style loss, the method preserves clothing attributes and improves body shape fidelity, outperforming prior unsupervised methods on DeepFashion and Market-1501, especially in attribute retention and structural consistency.
In this paper, we address unsupervised pose-guided person image generation, which is known challenging due to non-rigid deformation. Unlike previous methods learning a rock-hard direct mapping between human bodies, we propose a new pathway to decompose the hard mapping into two more accessible subtasks, namely, semantic parsing transformation and appearance generation. Firstly, a semantic generative network is proposed to transform between semantic parsing maps, in order to simplify the non-rigid deformation learning. Secondly, an appearance generative network learns to synthesize semantic-aware textures. Thirdly, we demonstrate that training our framework in an end-to-end manner further refines the semantic maps and final results accordingly. Our method is generalizable to other semantic-aware person image generation tasks, eg, clothing texture transfer and controlled image manipulation. Experimental results demonstrate the superiority of our method on DeepFashion and Market-1501 datasets, especially in keeping the clothing attributes and better body shapes.
Motivation & Objective
- To address the challenge of unsupervised, pose-guided person image generation without paired training data.
- To overcome difficulties in modeling non-rigid human body deformation and preserving clothing attributes in image synthesis.
- To reduce the complexity of direct image-to-image mapping by decomposing it into semantic parsing transformation and appearance generation.
- To enable generalization to downstream tasks such as clothing texture transfer and controlled image manipulation.
- To improve semantic map prediction quality through end-to-end training, refining both parsing and final image outputs.
Proposed method
- The framework decomposes person image generation into two modules: semantic parsing transformation and appearance generation.
- A semantic generative network performs pose-conditioned transformation between source and target parsing maps, simplifying non-rigid deformation learning.
- An appearance generative network synthesizes photo-realistic textures on the transformed parsing map using a semantic-aware style loss.
- Pseudo-labels and cycle consistency are used to train the semantic generator without paired supervision.
- A semantic-aware style loss ensures that texture mapping respects semantic regions, preserving attributes like sleeve length and fabric patterns.
- End-to-end training jointly optimizes both modules, enabling refinement of predicted semantic maps and improved image quality.
Experimental results
Research questions
- RQ1Can unsupervised person image generation be improved by decoupling the complex image-to-image mapping into semantic parsing transformation and appearance synthesis?
- RQ2How can semantic parsing transformation reduce the difficulty of modeling non-rigid human body deformations in image generation?
- RQ3To what extent can end-to-end training refine semantic map predictions and enhance final image quality in the absence of paired supervision?
- RQ4Can the proposed framework generalize to other conditional image generation tasks such as clothing texture transfer and layout-controlled image manipulation?
- RQ5What role does a semantic-aware style loss play in preserving clothing attributes during appearance generation?
Key findings
- The end-to-end training strategy significantly improves semantic map prediction, leading to better body shape and clothing attribute preservation compared to two-stage training.
- On the DeepFashion dataset, the end-to-end model achieves performance comparable to a two-stage baseline using ground-truth parsing maps.
- On the Market-1501 dataset, the end-to-end model outperforms even the two-stage baseline using ground-truth parsing maps, due to better handling of low-resolution parsing errors.
- The semantic-aware style loss is critical for preserving fine-grained clothing attributes; replacing it with mask-style or patch-style losses leads to distorted contours and artifacts.
- The face adversarial loss effectively improves the realism of generated faces, enhancing overall visual quality.
- The appearance generator enables successful clothing texture transfer and controlled image manipulation by modifying semantic maps, demonstrating the framework's versatility.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.