QUICK REVIEW

[论文解读] L-CAD: Language-based Colorization with Any-level Descriptions using Diffusion Priors

Zheng Chang, Shuchen Weng|arXiv (Cornell University)|May 24, 2023

Human Motion and Animation被引用 9

一句话总结

L-CAD 使用一个预训练的跨模态扩散模型来根据任意级别的自然语言描述进行着色，具备可维持空间结构、防止鬼影以及实现实例感知着色分配的模块。

ABSTRACT

Language-based colorization produces plausible and visually pleasing colors under the guidance of user-friendly natural language descriptions. Previous methods implicitly assume that users provide comprehensive color descriptions for most of the objects in the image, which leads to suboptimal performance. In this paper, we propose a unified model to perform language-based colorization with any-level descriptions. We leverage the pretrained cross-modality generative model for its robust language understanding and rich color priors to handle the inherent ambiguity of any-level descriptions. We further design modules to align with input conditions to preserve local spatial structures and prevent the ghosting effect. With the proposed novel sampling strategy, our model achieves instance-aware colorization in diverse and complex scenarios. Extensive experimental results demonstrate our advantages of effectively handling any-level descriptions and outperforming both language-based and automatic colorization methods. The code and pretrained models are available at: https://github.com/changzheng123/L-CAD.

研究动机与目标

实现从完整、部分或稀少语言描述中进行有效着色。
利用 Stable Diffusion 的语言理解和颜色先验来处理描述的歧义。
在跨模态解码过程中保持局部空间结构并防止颜色鬼影。
为包含多个对象的复杂场景提供实例感知着色分配。

提出的方法

采用 Stable Diffusion 作为骨干网络，以利用跨模态先验和语言理解。
引入一个亮度引导的图像压缩模块，以在解码阶段保持灰度空间结构。
在下采样模块中将普通卷积替换为通道扩展卷积（Channel-Extended Convolution, CEC）块，以使潜在特征与输入描述对齐。
在潜在空间中，使用基于 CLIP 的文本编码对任意级别描述进行条件化，并通过潜在空间对齐来避免颜色鬼影。
实现一个实例感知采样策略，使用指称分割估计并对区域进行渐进式注意力引导的着色分配。
分两阶段训练：在像素空间使用任意级别描述进行训练，然后在潜在空间进行扩散微调，固定预训练权重。

实验结果

研究问题

RQ1语言驱动着色如何处理从完整到稀少细节等级的描述？
RQ2是否可以引导扩散先验模型使颜色与灰度空间结构对齐并避免颜色鬼影？
RQ3实例感知采样在将颜色分配给复杂场景中的相应对象方面有多有效？
RQ4亮度引导压缩和语义对齐潜在表示对着色质量的影响是什么？

主要发现

L-CAD 在扩展的 COCO-Stuff 和多实例数据集上实现了基于语言的着色的最先进性能（完整/部分描述）。
在评估数据集上，L-CAD 在 PSNR、SSIM 和 LPIPS 指标上超越了基于语言的和自动着色方法。
用户研究显示，在两个数据集上，L-CAD 相对于基线具有更高的对应感和真实感评分。
消融研究证实亮度引导压缩、语义对齐的潜在表示以及实例感知采样对着色质量的有效性。
在 ImageNet 的稀少条件下，L-CAD 达到具有竞争力的 FID、PSNR、SSIM 和 LPIPS，表明在极少指导下也具有鲁棒的自动着色能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。