QUICK REVIEW

[论文解读] Unicom: Universal and Compact Representation Learning for Image Retrieval

Xiang An, Jiankang Deng|arXiv (Cornell University)|Apr 12, 2023

Advanced Image and Video Retrieval Techniques被引用 17

一句话总结

Unicom通过对LAION-400M上的CLIP图像-文本特征进行聚类，并应用冲突鲁棒的随机负原型选择与随机特征选择，学习通用且紧凑的图像表示，从而提升无监督和有监督的图像检索。

ABSTRACT

Modern image retrieval methods typically rely on fine-tuning pre-trained encoders to extract image-level descriptors. However, the most widely used models are pre-trained on ImageNet-1K with limited classes. The pre-trained feature representation is therefore not universal enough to generalize well to the diverse open-world classes. In this paper, we first cluster the large-scale LAION400M into one million pseudo classes based on the joint textual and visual features extracted by the CLIP model. Due to the confusion of label granularity, the automatically clustered dataset inevitably contains heavy inter-class conflict. To alleviate such conflict, we randomly select partial inter-class prototypes to construct the margin-based softmax loss. To further enhance the low-dimensional feature representation, we randomly select partial feature dimensions when calculating the similarities between embeddings and class-wise prototypes. The dual random partial selections are with respect to the class dimension and the feature dimension of the prototype matrix, making the classification conflict-robust and the feature embedding compact. Our method significantly outperforms state-of-the-art unsupervised and supervised image retrieval approaches on multiple benchmarks. The code and pre-trained models are released to facilitate future research https://github.com/deepglint/unicom.

研究动机与目标

解决 ImageNet 预训练特征在开放世界检索中的泛化能力有限的问题。
利用多模态（图像+文本）聚类从大规模未标注语料中形成伪类别。
设计带有随机负原型选择的鲁棒判别目标，以处理类间冲突。
通过随机特征选择促进特征紧凑性，从而提高检索效率。

提出的方法

在 LAION-400M 的图像-文本特征混合数据上，用离线 kmeans 将图像和 CLIP 文本特征聚成 100 万个伪类别。
训练时通过每次迭代随机选择一部分负原型（类别维）来得到冲突鲁棒的边距软最大（margin-based softmax）。
在损失计算时，通过在嵌入和原型中随机选择特征维子空间（共享 Gamma_t 掩码）来强制特征紧凑。
维持完整的原型矩阵，但每次迭代只更新随机子集的类别和特征，以降低类间冲突并促进紧凑性。
对预训练和检索任务均使用 ArcFace 风格的边距软最大（margin=0.3，scale=64）。
在聚类中可选地对图像和文本特征进行融合（平均 fusion）以形成原型。

实验结果

研究问题

RQ1使用随机负原型选择的聚类判别是否在 CLIP 基础的系统上优于实例判别，从而提升通用表征学习？
RQ2在判别中进行随机特征选择是否能够在不牺牲准确性的情况下获得紧凑但具竞争力的检索嵌入？
RQ3聚类的簇数（k）以及聚类模态（图像、文本或联合）如何影响检索性能？
RQ4所提出的方法是否能推广到无监督和有监督的图像检索，以及迁移学习（如 ImageNet-1K）？

主要发现

在13个数据集上的线性探针显示，所提议的聚类判别在相同数据下优于 CLIP 和 OPEN-CLIP，平均提升分别为 3.6%（ViT B/32）、2.7%（ViT B/16）和 1.4%（ViT L/14）。
使用 ViT L/14 的无监督图像检索在7个数据集上获得69.9%的平均mAP，超越 OPEN-CLIP 7.5%，也超越更大 OPEN-CLIP 模型 5.4%。
对 ImageNet-1K 的迁移学习显示有竞争力的 Top-1 准确率，例如在 LAION-400M 预训练时，ViT B/16 Ours 达到 85.9%，ViT L/14 Ours 达到 88.3% Top-1。
与现有方法相比，Unicom 在联合图像+文本聚类和随机选择策略下，在多样数据集（CUB、Cars、SOP、In-Shop、INaturalist、VehicleID、GLDv2）的线性探针和检索基准上均有持续提升。
消融研究表明，随机负类采样（r1 ≈ 0.1）和随机特征采样（r2 ≈ 0.5）对于实现强性能和特征紧凑性具有重要作用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。