[论文解读] A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation
RECAP 通过一个调优的图像到文本模型重标注训练标题,以便在更高质量的标题上训练文本到图像模型,显著提升图像保真度和语义对齐。
Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.
研究动机与目标
- Motivate that caption quality in open web datasets limits T2I models.
- Propose a recaptioning pipeline to relabel training data with an automatic I2T model.
- Demonstrate improved image quality and semantic alignment when training on recaptioned data.
提出的方法
- Fine-tune PaLI on a small human-caption set to generate detailed RECAP captions.
- Recaption 10M LAION-2B-en improved images with RECAP Short, RECAP Long, and RECAP Mix captions.
- Fine-tune Stable Diffusion v1.4 on the recaptioned dataset with a 50/50 mix of RECAP Short and Long captions (RECAP Mix).
- Evaluate using automated metrics (FID, O-FID, SOA-C, SOA-I, CA, PA, RP) and human studies.
- Compare to Baseline (SD v1.4) and Alttext (caption-original) models.

实验结果
研究问题
- RQ1Does relabeling training captions with a specialized captioning model improve T2I model performance across fidelity and semantics?
- RQ2How do short vs. long captions, and their mix, affect image quality and semantic alignment?
- RQ3What is the impact of caption quality on train-inference skew and sample efficiency?
- RQ4Which model components (UNet vs. CLIP weights) benefit most from RECAP captions?
主要发现
| Model | FID | O-FID | SOA-C | SOA-I | CA | PA | RP |
|---|---|---|---|---|---|---|---|
| Baseline | 17.87 | 8.19 | 78.90 | 80.80 | 1.44 | 57.60 | 92.78 |
| Alttext | 17.53 | 8.90 | 78.99 | 80.85 | 1.47 | 57.40 | 91.32 |
| RECAP | 14.84 | 6.23 | 84.34 | 86.17 | 1.32 | 62.42 | 93.80 |
| Real Images | 2.62 | 0.00 | 90.02 | 91.19 | 1.05 | 100.0 | 83.54 |
- RECAP yields substantially better image quality (FID 17.87→14.84) and higher semantic fidelity (SOA-C 78.90→84.34, SOA-I 80.80→86.17).
- RECAP achieves improved counting and positional alignment (CA and PA) and higher CLIP-based prompt alignment (RP).
- Human evaluation shows a 64.3% improvement in successful image generation on MS-COCO and 41.7% on DrawBench for RECAP vs. Baseline; Alttext shows minimal gains.
- Mixing RECAP Short and Long captions (RECAP Mix) provides best overall performance, combining faster FID gains with semantic improvements.
- Training CLIP and UNet weights with RECAP Mix yields larger semantic gains than training either alone.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。