QUICK REVIEW

[论文解读] Visual Classification via Description from Large Language Models

Sachit Menon, Carl Vondrick|arXiv (Cornell University)|Oct 13, 2022

Multimodal Machine Learning Applications被引用 57

一句话总结

本文用 GPT-3 生成的语言描述符替代类别名称嵌入，并通过 CLIP 将其定向到零-shot 视觉分类、可解释性和适应性。

ABSTRACT

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

研究动机与目标

推动用描述性语言描述符替换视觉类别的原始类别名称。
提出使用大语言模型生成描述符的可扩展方法。
用视觉-语言模型对描述符进行锚定，以透明地计算类别分数。
展示在准确性、对新概念的适应性以及偏置纠正能力方面的改进。

提出的方法

用以自然语言句子表达的描述符集合 D(c) 表示类别 c。
将类别得分 s(c, x) 计算为描述符相关性的平均值： s(c,x)= (1/|D(c)|) * sum_{d in D(c)} phi(d,x)，其中 phi(d,x) 是描述符 d 与图像 x 相关性的对数概率。
通过向大型语言模型（如 GPT-3）提出诸如：‘在一张照片中区分 {category} 的有用特征有哪些？’来自动构建 D(c)。
使用视觉-语言模型（CLIP）对描述符进行锚定，方法是测量图像与文本描述符之间的相似性，且以类别名称为条件。
通过允许检查对给定图像激活了哪些描述符以及为何选择该类别来提供可解释性。
通过选择具有最高 s(c,x) 的类别来进行分类。
描述如何编辑描述符以减轻偏差并适应新概念。

实验结果

研究问题

RQ1使用 LLM 生成属性的基于描述符的分类是否能在准确性上超越标准 CLIP 风格的类别名称嵌入？
RQ2基于描述符的模型是否通过暴露驱动决策的特征来提供固有的可解释性？
RQ3GPT-3 派生的描述符是否能够识别在训练或部署后未见过的概念？
RQ4描述符编辑如何影响偏见并提高对不同人口统计或文化子群体的公平性？

主要发现

在多个数据集上的 CLIP 上实现稳定的准确性提升，在 ImageNet 上约提升 3–5%，在某些非自然图像领域最高可达约 7%。
展示通过利用 GPT-3 描述符在训练后识别新概念的能力（如 Wordle、Ever Given），在这些示例的前10名中达到 100% 召回率，而 CLIP 无法做到。
描述符通过显示对决策有贡献的特征（描述符）来实现可解释的预测。
描述符编辑可缓解偏差（如婚礼文化偏见）并改善对代表性不足群体的准确性。
该方法在不需要额外训练的前提下实现可解释性，并可通过大语言模型生成的描述符进行扩展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。