QUICK REVIEW

[论文解读] Tag2Text: Guiding Vision-Language Model via Image Tagging

Xinyu Huang, Youcai Zhang|arXiv (Cornell University)|Mar 10, 2023

Multimodal Machine Learning Applications被引用 11

一句话总结

Tag2Text 引入从无注释的图像–文本对中学习的图像标注，用来引导视觉-语言预训练，在零-shot 标注和生成与对齐任务上取得强劲表现。

ABSTRACT

This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with an off-the-shelf detector with limited performance, our approach explicitly learns an image tagger using tags parsed from image-paired text and thus provides a strong semantic guidance to vision-language models. In this way, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. As a result, Tag2Text demonstrates the ability of a foundational image tagging model, with superior zero-shot performance even comparable to fully supervised models. Moreover, by leveraging the tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance. Code, demo and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

研究动机与目标

通过注入丰富且无注释的图像标签（超越对象）来激发改进视觉-语言预训练。
通过从成对文本中推导标签来实现可扩展的标注指导，而非人工标注或现成检测器。
证明标签指导在无检测器架构下改善生成型和对齐型的VL任务。
显示一个庞大且多样的标签集（3,429 类别）能提升零-shot 标注和下游VL基准。

提出的方法

利用文本语义解析器从图像–文本对中挖掘图像标签，产出 3,429 个常见标签类别。
引入一个图像标注头，学习在无需人工注释的情况下预测解析后的标签。
提出将图像标签-文本生成作为预训练任务，以在给定图像特征与分配的标签条件下生成描述。
加入一个图像-文本对齐组件，包含粗粒度 ITC 和细粒度 ITM 损失，使用由标签引导的困难负样本挖掘。
采用多任务目标训练：标注、生成（图像-标签-文本生成）和对齐（ITC/ITM）。
允许标注引导的推断，在用户提供的标签指引下驱动标题生成与检索。

实验结果

研究问题

RQ1从文本解析的无注释图像标签能否为视觉-语言预训练提供强的语义指引？
RQ2与检测器基线相比，带有图像标注的无检测VL预训练是否在生成型和对齐型任务上都有所提升？
RQ3标注指导对零-shot 标注、标题质量以及跨模态检索的影响如何？
RQ4多少以及哪些类型的标签（对象、场景、属性、动作）对VL学习有益？

主要发现

Tag2Text 在 OpenImages 和 COCO 上实现优越的零-shot 图像标注，相较于最先进的视觉-语言模型。
图像标注指导在无检测器VL模型下提升生成型任务（字幕/描述生成）和对齐型任务（图像-文本检索）。
使用 4M 和 14M 图像-文本对进行的预训练在各基准上获得强劲的标注、描述和检索结果，其中 Tag2Text-Swin 表现尤为出色。
标注头和大型多样标签集提供了比检测器为基础的方法更好的图像与文本桥接，同时在端到端训练中保持高效。
一个两阶段的预训练与微调范式（在大规模文本派生标签上进行预训练，然后在下游任务上微调）提升多标签识别和下游VL表现。
标注指导为标题生成提供可控性，允许用户指定标签来引导生成的描述。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。