Skip to main content
QUICK REVIEW

[论文解读] $f$-MICL: Understanding and Generalizing InfoNCE-based Contrastive Learning

Yiwei Lu, Guojun Zhang|arXiv (Cornell University)|Feb 15, 2024
Fuzzy Logic and Control Systems被引用 5
一句话总结

本论文将 InfoNCE 推广到基于 f 发散的互信息(f-MICL),引入 f-高斯相似度,并在不同架构的视觉与语言基准测试中展示了经验收益。

ABSTRACT

In self-supervised contrastive learning, a widely-adopted objective function is InfoNCE, which uses the heuristic cosine similarity for the representation comparison, and is closely related to maximizing the Kullback-Leibler (KL)-based mutual information. In this paper, we aim at answering two intriguing questions: (1) Can we go beyond the KL-based objective? (2) Besides the popular cosine similarity, can we design a better similarity function? We provide answers to both questions by generalizing the KL-based mutual information to the $f$-Mutual Information in Contrastive Learning ($f$-MICL) using the $f$-divergences. To answer the first question, we provide a wide range of $f$-MICL objectives which share the nice properties of InfoNCE (e.g., alignment and uniformity), and meanwhile result in similar or even superior performance. For the second question, assuming that the joint feature distribution is proportional to the Gaussian kernel, we derive an $f$-Gaussian similarity with better interpretability and empirical performance. Finally, we identify close relationships between the $f$-MICL objective and several popular InfoNCE-based objectives. Using benchmark tasks from both vision and natural language, we empirically evaluate $f$-MICL with different $f$-divergences on various architectures (SimCLR, MoCo, and MoCo v3) and datasets. We observe that $f$-MICL generally outperforms the benchmarks and the best-performing $f$-divergence is task and dataset dependent.

研究动机与目标

  • 将对比学习从基于 KL 的互信息(InfoNCE)扩展到更广泛的 f-互信息(f-MICL)。
  • 研究在对比目标中是否可以用其他相似度替代余弦相似度以获得更好性能。
  • 在对联合特征分布的假设下,开发一个可行的 f-高斯相似度。
  • 在数据集、架构和模态之间证明 f-MICL 的经验收益。
  • 将 f-MICL 与基于 InfoNCE 的现有目标联系起来,并提供对齐性与均匀性等理论性质。

提出的方法

  • 将 MI 广义化为 f-MI 框架,并推导用于对比学习优化的变分下界。
  • 提出 f-MICL 目标:max_{s in F} E_{(x,y)~p+} s(g(x),g(y)) - E_{(x,y)~p×} f*∘s(g(x),g(y)).
  • 引入 f-高斯相似度 s_f(x^g,y^g) = f'∘G_σ(||x^g - y^g||^2) 作为可行的相似度度量。
  • 在单位超球面的联合特征密度与高斯核近似成比例的假设下,得到一个基于 f 和高斯先验的实用 s_f。
  • 为一个批次提供经验估计:(1/N)∑ s_f(x_i^g,y_i^g) - (α/(N(N-1)))∑_{i≠j} f*∘s_f(x_i^g,x_j^g).
  • Show connections to existing objectives (InfoNCE, AU, Spectral Contrastive Loss) and discuss alignment/uniformity properties.
Figure 1 : Experiment for verifying Assumption 3 . Here we draw the relation between the squared distances $\|x^{g}-y^{g}\|^{2}$ and the averaged log likelihood $\log p_{g}$ , with $\log p_{g}$ estimated by the flow model RealNVP (Dinh et al., 2017 ) . ( left ) Gaussian prior; (right) Uniform prior.
Figure 1 : Experiment for verifying Assumption 3 . Here we draw the relation between the squared distances $\|x^{g}-y^{g}\|^{2}$ and the averaged log likelihood $\log p_{g}$ , with $\log p_{g}$ estimated by the flow model RealNVP (Dinh et al., 2017 ) . ( left ) Gaussian prior; (right) Uniform prior.

实验结果

研究问题

  • RQ1Can we extend InfoNCE’s KL-based objective to a wider family of f-divergences (f-MICL) without sacrificing performance?
  • RQ2Is cosine similarity the best choice for measuring similarity in contrastive learning, or can f-Gaussian or other similarity functions improve results?
  • RQ3What theoretical and empirical properties (like alignment and uniformity) extend from InfoNCE to the f-MICL framework?
  • RQ4如何在 vision 和 language 任务中评估不同 f-divergences 对下游表示的影响?

主要发现

  • f-MICL provides a spectrum of objectives via different f-divergences, often yielding similar or superior performance to InfoNCE.
  • The proposed f-Gaussian similarity consistently outperforms cosine similarity across tested f-divergences.
  • InfoNCE is an upper bound of the f-MICL objective, linking the new framework to existing methods.
  • AU is shown as a special case within the f-MICL framework, illustrating alignment and uniformity properties extend to f-MICL.
  • Different datasets/tasks prefer different f-divergences; no single f-divergence dominates all settings.
  • f-Gaussian similarity improves performance across datasets such as CIFAR-10, STL-10, TinyImageNet, and ImageNet when using MoCo v3 with ViT-S.
Figure 2 : Network architecture of $f$ -MICL. $\mathtt{image}_{i}$ : the $i^{\rm th}$ image in the current batch; $f$ : the function used in the $f$ -mutual information (§ 2 ); $g$ : feature embedding; $t$ , $t_{1}$ , $t_{2}$ : augmentation functions drawn from the same family $\mathcal{T}$ of augme
Figure 2 : Network architecture of $f$ -MICL. $\mathtt{image}_{i}$ : the $i^{\rm th}$ image in the current batch; $f$ : the function used in the $f$ -mutual information (§ 2 ); $g$ : feature embedding; $t$ , $t_{1}$ , $t_{2}$ : augmentation functions drawn from the same family $\mathcal{T}$ of augme

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。