Skip to main content
QUICK REVIEW

[论文解读] Closing the Modality Gap Aligns Group-Wise Semantics

Eleonora Grassucci, Giordano Cicchetti|arXiv (Cornell University)|Jan 26, 2026
Domain Adaptation and Few-Shot Learning被引用 0
一句话总结

本论文表明通过缩小多模态对比学习中的模态距离差异,能在聚类等组级任务上获得提升,同时不影响实例级检索,通过引入 Align True Pairs 和 Centroid Uniformity 损失。该方法减少跨模态距离并让双模态与三模态数据的语义簇更加紧密。

ABSTRACT

In multimodal learning, CLIP has been recognized as the extit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.

研究动机与目标

  • 推动并量化多模态模型在检索任务之外的模态差距。
  • 证明减少差距能增强聚类等组级语义。
  • 提出一个简单、可扩展的目标,在不改变架构的前提下关闭两种或更多模态之间的差距。
  • 在双模态和三模态基准上展示经验收益,同时保持实例级性能。

提出的方法

  • 将 InfoNCE 为基础的对比学习公式化,并使用质心距离和真对 cosine 指标来定义模态差距。
  • 引入 Align True Pairs(L_ATP),以最小化模态间到共同锚点的距离。
  • 引入 Centroid Uniformity(L_CU),促进模态质心的均匀分布并避免坍缩。
  • 将 L_gap = L_ATP + L_CU 与标准双向对比损失结合,得到 L_CL_gap。
  • 在不改变架构的前提下,将方法从双模态扩展到多模态(两种以上模态)。
  • 证明 L_CL_gap 能将模态差距趋近于零,同时保持真对的对齐并改善组级结构。
Figure 1: Reducing the gap consistently improves clustering metrics, while leaving unaffected retrieval ones. On the contrary, increasing the gap downgrades the V-Measure, bringing no improvements in R@1. In CLIP, the gap results in very poor clustering performance due to the latent space fragmentat
Figure 1: Reducing the gap consistently improves clustering metrics, while leaving unaffected retrieval ones. On the contrary, increasing the gap downgrades the V-Measure, bringing no improvements in R@1. In CLIP, the gap results in very poor clustering performance due to the latent space fragmentat

实验结果

研究问题

  • RQ1减小模态差距是否在跨多个模态的聚类指标(如 V-Measure)上比检索指标提升更显著?
  • RQ2一个简单的目标将真对对齐和质心均匀性结合起来,是否能在不损害实例级性能的前提下缩小差距?
  • RQ3所提的差距缩小方法是否可扩展到三模态及更大的多模态设置?

主要发现

MethodDatasetGap ↓TV R@1TA R@1V-MeasurekNN
CLIP (LT)CIFAR10 (2 modal)0.8682.0-67.081.2
CLIP (FT)CIFAR10 (2 modal)0.1482.1-67.681.9
OursCIFAR10 (2 modal)0.0982.4-67.982.4
CLIP (LT)MSCOCO (2 modal)0.4774.6-12.9826.3
CLIP (FT)MSCOCO (2 modal)0.1273.2-12.9931.0
OursMSCOCO (2 modal)0.0370.3-23.6336.4
CLIP (LT)AV-MNIST (3 modal)0.2087.184.277.687.0
CLIP (FT)AV-MNIST (3 modal)0.2484.180.473.885.0
OursAV-MNIST (3 modal)0.0988.789.182.789.2
  • 缩小差距在 CIFAR10、MSCOCO、以及 AV-MNIST 数据集上稳定地提升聚类指标(V-Measure、kNN)。
  • 检索指标(R@1 for TV/TA)在差距减小时被保留或仅有轻微影响。
  • 所提方法将 MSCOCO 与 AV-MNIST 的模态差距缩小到接近零,同时显著提高真对的余弦相似性。
  • 在双模态与三模态基准中,差距的降低带来更好的组级语义,而不削弱实例级检索。
  • 该方法几乎实现零质心差距,并得到更平衡、语义连贯的多模态表示(如空间可视化和表格结果所示)。
Figure 2: AV-MNIST multimodal latent space. The CLIP-based learning creates a fragmented latent space with embeddings clearly clustered by modality and not by multimodal semantics. Our method closes the gap and enhances group-wise semantics, placing embeddings of the same class in the same portion o
Figure 2: AV-MNIST multimodal latent space. The CLIP-based learning creates a fragmented latent space with embeddings clearly clustered by modality and not by multimodal semantics. Our method closes the gap and enhances group-wise semantics, placing embeddings of the same class in the same portion o

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。