QUICK REVIEW

[论文解读] Recognize Anything: A Strong Image Tagging Model

Youcai Zhang, Xinyu Huang|arXiv (Cornell University)|Jun 6, 2023

Multimodal Machine Learning Applications被引用 21

一句话总结

RAM 是一个用于图像标签的基础模型，利用无注释的图像-文本数据并结合语义标签查询，在 6,400+ 标签和开放集合类别上实现强零样本识别，覆盖 6,400+ 标签和开放式类别，性能超过 CLIP、BLIP 以及一些全监督基线。

ABSTRACT

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.

研究动机与目标

建立一个普遍、统一的标签体系，覆盖分类、检测和分割数据集中的常见标签，以及商业标签产品。
开发一个数据高效、开放词汇的标注模型，能够对未见类别进行零样本识别。
创建一个数据引擎，从大规模图像-文本数据中自动生成并清理注释，以提升标签质量。
展示 RAM 在分类、检测和分割基准上的零样本标注性能，并与最先进模型进行比较。

提出的方法

通过自动文本语义解析，将字幕/描述解析为大规模无注释的图像标签。
训练一个联合的说明文字生成与标注模型，以利用图像-标签-文本三元组。
引入一个现成的文本编码器，将标签转换为语义丰富的文本标签查询，以实现开放词汇识别。
使用视觉骨干网络（Swin Transformer），配以轻量级的图像-标签识别解码器和用于字幕生成的文本-生成编码器-解码器。
使用 CLIP 进行图像特征蒸馏，以提升未见类别的识别能力并实现开放集能力。
构建数据引擎以生成附加标签，使用 Grounding-DINO 实例定位区域，对区域进行聚类并滤除异常值以清理标签。
在较小的高质量数据集（COCO）上进行微调以提升性能。

实验结果

研究问题

RQ1在广义的开放类别集合（6,400+）上，RAM 是否能够仅使用无注释训练数据实现强零样本图像标注？
RQ2将字幕生成与标注相结合，并使用语义信息丰富的文本标签查询，对开放集识别和整体标注准确率有何影响？
RQ3数据引擎（生成、清洗、微调）在标签质量和下游零样本性能上的提升是什么？
RQ4在零样本和有监督设置中，RAM 与最先进的多标签分类、检测、分割以及视觉-语言模型相比如何？

主要发现

RAM 在多个基准上实现了强劲的零样本标注性能，并显著优于 CLIP 和 BLIP。
RAM 超越了一些全监督方法，在多种开放集合场景中与 Google 标签 API 竞争。
使用 4M 预训练数据的 RAM 已经在 OpenImages-common 上超越 ML-Decoder，而 RAM-14M 在各项测试上进一步提升。
扩展标签体系并结合语义文本标签查询，显著提升开放集识别和标注覆盖率。
数据引擎（生成、清洗、扩展到 14M 图像，以及在 COCO 上微调）在 OPPO-common、OpenImages-common 和 OpenImages-rare 上带来显著的性能提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。