QUICK REVIEW

[论文解读] CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Dongzhi Jiang, Guanglu Song|arXiv (Cornell University)|Apr 4, 2024

Image Retrieval and Classification Techniques被引用 6

一句话总结

CoMat 使用一个图像到文本的概念匹配机制来微调扩散模型，利用一个图像字幕生成模型，以及一个属性集中模块来提升属性绑定，在无需图像-文本对的情况下实现了最先进的文本到图像对齐。

ABSTRACT

Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.

研究动机与目标

动机与诊断扩散基础的 T2I 模型中文本提示与生成图像之间的错位。
提出一个端到端的微调框架，利用图像到文本概念匹配来重新平衡令牌注意力。
通过实体属性集中模块来改善属性绑定。
在引导对齐的同时保持原始生成能力，通过保真度保持机制。

提出的方法

使用扩散模型从文本提示生成图像，并使用一个冻结的图像字幕生成模型来对提示中的概念计算 p(C | image) 的分数。
通过去噪过程进行反向传播以优化扩散模型，使图像中能够激活缺失的概念（概念匹配损失）。
从提示中提取实体（名词）和修饰语，使用 Grounded-SAM 获得区域掩模，并应用令牌级和像素级的注意力损失以将名词和修饰语与图像区域对齐（属性集中）。
加入对抗式保真度保持损失，使用从预训练扩散模型初始化的判别器，以避免过拟合于字幕基奖励并保持生成质量。
端到端训练，联合目标函数：L = L_cap + alpha L_token + beta L_pixel + lambda L_adv，且不需要图像或人类偏好。

实验结果

研究问题

RQ1图像到文本的概念匹配是否能够引导改进令牌注意力，从而减少提示与生成图像之间的错位？
RQ2在对象区域强制属性集中是否能增强属性绑定和整体提示保真度？
RQ3在奖励驱动的微调中如何保持保真度，以防止生成质量下降？
RQ4该方法是否在不同的基本扩散模型和字幕生成骨干网络上都有效？

主要发现

CoMat-SDXL 在 T2I-CompBench 上实现了文本到图像对齐的最先进水平，相对于 SDXL 基线在属性绑定和空间关系方面取得显著提升。
CoMat-SD1.5 相对于 SD1.5 基线显示出显著改进，包括在空间关系方面的巨大提升（某些指标超过 70%）。
在 TIFA 上，CoMat-SDXL 比 SDXL 提升了 1.8 点，CoMat-SD1.5 比 SD1.5 提升了 7.3 点。
消融研究表明概念匹配带来显著收益，加入属性集中在多个子类别上带来进一步改进。
使用预训练的 UNet 作为保真度保持的判别器在不降低生成质量的情况下实现了图像保真度与对齐的最佳平衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。