QUICK REVIEW

[论文解读] Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space

Yuhao Wang, Lingjuan Miao|arXiv (Cornell University)|Feb 26, 2024

Advanced Image Fusion Techniques被引用 13

一句话总结

引入基于 CLIP 的面向语言的目标用于红外可见图像融合，通过在 CLIP 空间对齐融合输出与语言表达的融合模型，在没有地面实况监督的情况下提升融合质量。

ABSTRACT

Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performance of existing fusion methods is limited. In this paper, we propose to use natural language to express the objective of IVIF, which can avoid the explicit mathematical modeling of fusion output in current losses, and make full use of the advantage of language expression to improve the fusion performance. For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. A language-driven fusion model is then constructed in the embedding space, by establishing the relationship among the embedded vectors representing the fusion objective and input image modalities. Finally, a language-driven loss is derived to make the actual IVIF aligned with the embedded language-driven fusion model via supervised training. Experiments show that our method can obtain much better fusion results than existing techniques. The code is available at https://github.com/wyhlaowang/LDFusion.

研究动机与目标

通过用自然语言表达融合目标来激励红外可见图像融合（IVIF），以避免显式的数学损失设计。
利用 CLIP 将输入模态和融合目标编码到一个共享的嵌入空间中。
提出一种语言驱动的融合模型及相应的损失，以使实际融合与语言描述的目标保持一致。

提出的方法

使用 CLIP 图像编码器对红外和可见输入进行编码，以获得嵌入向量。
使用 CLIP 文本编码器对描述输入和期望融合的语言提示进行编码，形成语言驱动的融合模型。
定义一种语言驱动的融合损失，通过两个模态的 ΔV（向量增量）在嵌入空间中的平行性来对齐输入到目标的转变。
加入用于局部引导的多尺度、基于补丁的融合方向损失版本（L_d^†）。
添加正则化项，防止融合后的嵌入坍缩到源嵌入（Φ）。
引入基于 VGG-19 特征的特征保真损失，以保留内容并抑制不需要的内容（L_v）。
训练一个三组件融合网络，包含两分支编码器、跨融合注意力和解码器，以生成融合图像。

实验结果

研究问题

RQ1在 CLIP 空间中的语言表达目标是否可以在没有地面真实融合图像的情况下引导红外可见图像融合？
RQ2将实际融合转变与语言驱动的嵌入模型对齐是否能在不同数据集和指标上提升融合质量？
RQ3跨融合注意力和语言驱动损失在保留显著目标与背景细节方面的影响如何？
RQ4在标准融合指标下，所提方法相较于最先进的 IVIF 方法的表现如何？

主要发现

指标	FusionGAN	MFEIF	PIAFusion	PMGI	RFN	SwinFusion	U2Fusion	UMF	GANMcC	Ours
EN	6.550	6.749	6.929	7.058	7.086	6.908	7.035	6.629	6.791	7.335
AG	3.069	3.685	6.029	4.616	3.066	5.801	6.430	4.113	3.395	9.878
SD	30.487	33.827	41.400	38.707	40.224	39.735	37.894	30.817	34.162	51.502
SF	3.922	4.345	6.291	5.232	3.837	6.166	6.787	4.674	4.082	8.365
VIFF	0.265	0.376	0.405	0.593	0.575	0.451	0.699	0.359	0.433	0.751

相比 9 种 SOTA 方法，在 TNO 和 RoadScene 数据集以及 EN、AG、SD、SF、VIFF 指标上实现了更优的融合质量。
语言驱动损失（LDL）在视觉感知、对比度和细节保留方面显著优于没有 LDL 的消融结果。
跨融合注意力（CFA）提升多模态信息的局部融合，改善边缘保真和背景结构。
在低光/夜间条件下，融合结果仍然鲁棒，目标显著性和背景细节更好。
表 1 的定量结果显示所提出的方法在评估数据集上获得了最佳 EN、AG、SD、SF 和 VIFF。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。