QUICK REVIEW

[论文解读] Closing the Modality Gap Aligns Group-Wise Semantics

Eleonora Grassucci, Giordano Cicchetti|arXiv (Cornell University)|Jan 26, 2026

Domain Adaptation and Few-Shot Learning被引用 0

一句话总结

本论文表明通过缩小多模态对比学习中的模态距离差异，能在聚类等组级任务上获得提升，同时不影响实例级检索，通过引入 Align True Pairs 和 Centroid Uniformity 损失。该方法减少跨模态距离并让双模态与三模态数据的语义簇更加紧密。

ABSTRACT

In multimodal learning, CLIP has been recognized as the extit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.

研究动机与目标

推动并量化多模态模型在检索任务之外的模态差距。
证明减少差距能增强聚类等组级语义。
提出一个简单、可扩展的目标，在不改变架构的前提下关闭两种或更多模态之间的差距。
在双模态和三模态基准上展示经验收益，同时保持实例级性能。

提出的方法

将 InfoNCE 为基础的对比学习公式化，并使用质心距离和真对 cosine 指标来定义模态差距。
引入 Align True Pairs（L_ATP），以最小化模态间到共同锚点的距离。
引入 Centroid Uniformity（L_CU），促进模态质心的均匀分布并避免坍缩。
将 L_gap = L_ATP + L_CU 与标准双向对比损失结合，得到 L_CL_gap。
在不改变架构的前提下，将方法从双模态扩展到多模态（两种以上模态）。
证明 L_CL_gap 能将模态差距趋近于零，同时保持真对的对齐并改善组级结构。

Figure 1: Reducing the gap consistently improves clustering metrics, while leaving unaffected retrieval ones. On the contrary, increasing the gap downgrades the V-Measure, bringing no improvements in R@1. In CLIP, the gap results in very poor clustering performance due to the latent space fragmentat

实验结果

研究问题

RQ1减小模态差距是否在跨多个模态的聚类指标（如 V-Measure）上比检索指标提升更显著？
RQ2一个简单的目标将真对对齐和质心均匀性结合起来，是否能在不损害实例级性能的前提下缩小差距？
RQ3所提的差距缩小方法是否可扩展到三模态及更大的多模态设置？

主要发现

Method	Dataset	Gap ↓	TV R@1	TA R@1	V-Measure	kNN
CLIP (LT)	CIFAR10 (2 modal)	0.86	82.0	-	67.0	81.2
CLIP (FT)	CIFAR10 (2 modal)	0.14	82.1	-	67.6	81.9
Ours	CIFAR10 (2 modal)	0.09	82.4	-	67.9	82.4
CLIP (LT)	MSCOCO (2 modal)	0.47	74.6	-	12.98	26.3
CLIP (FT)	MSCOCO (2 modal)	0.12	73.2	-	12.99	31.0
Ours	MSCOCO (2 modal)	0.03	70.3	-	23.63	36.4
CLIP (LT)	AV-MNIST (3 modal)	0.20	87.1	84.2	77.6	87.0
CLIP (FT)	AV-MNIST (3 modal)	0.24	84.1	80.4	73.8	85.0
Ours	AV-MNIST (3 modal)	0.09	88.7	89.1	82.7	89.2

缩小差距在 CIFAR10、MSCOCO、以及 AV-MNIST 数据集上稳定地提升聚类指标（V-Measure、kNN）。
检索指标（R@1 for TV/TA）在差距减小时被保留或仅有轻微影响。
所提方法将 MSCOCO 与 AV-MNIST 的模态差距缩小到接近零，同时显著提高真对的余弦相似性。
在双模态与三模态基准中，差距的降低带来更好的组级语义，而不削弱实例级检索。
该方法几乎实现零质心差距，并得到更平衡、语义连贯的多模态表示（如空间可视化和表格结果所示）。

Figure 2: AV-MNIST multimodal latent space. The CLIP-based learning creates a fragmented latent space with embeddings clearly clustered by modality and not by multimodal semantics. Our method closes the gap and enhances group-wise semantics, placing embeddings of the same class in the same portion o

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。