QUICK REVIEW

[论文解读] ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Chunyuan Li, Haotian Liu|arXiv (Cornell University)|Apr 19, 2022

Multimodal Machine Learning Applications被引用 64

一句话总结

Elevater 提供一个公开基准和开源工具包，用于在 20 个图像分类数据集和 35 个对象检测数据集上评估语言增强的视觉模型的任务级迁移，包含知识增强和自动超参数调优。

ABSTRACT

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public benchmarks. To tackle this, we build ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark and toolkit for evaluating(pre-trained) language-augmented visual models. ELEVATER is composed of three components. (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to facilitate model evaluation on downstream tasks. (iii) Metrics. A variety of evaluation metrics are used to measure sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). ELEVATER is a platform for Computer Vision in the Wild (CVinW), and is publicly released at at https://computer-vision-in-the-wild.github.io/ELEVATER/

研究动机与目标

评估语言增强的视觉模型在多样化的真实下游数据集上的迁移能力。
引入外部知识源以增强下游任务，并研究其对零-shot、少-shot、全-shot 迁移的影响。
提供一个自动化工具包，用于跨基准的公平、可重复的模型适应与评估。

提出的方法

组装一个公开基准（ICinW 包含 20 个 IC 数据集，ODinW 包含 35 个 OD 数据集），每个数据集都附加外部知识。
开发一个开源工具包，具备自动超参数调优，以避免手动调参并确保公平比较。
提出语言增强的自适应方法，包括语言初始化的两投影和单投影方案用于模型适配。
将零-shot、少-shot、全-shot 迁移以及线性探针与全模型微调作为效率指标进行评估。
结合外部知识源（WordNet、Wiktionary、GPT-3）以评估它们对零-/少-/全-shot 迁移的影响。

实验结果

研究问题

RQ1语言增强如何影响跨多样化数据集的图像分类和对象检测的任务级迁移？
RQ2外部知识源对零-shot/少-shot/全-shot 迁移性能有何影响？
RQ3哪些自适应策略（线性探针 vs 微调）和初始化方案在下游任务中最充分利用语言与知识？

主要发现

在少-shot 设置中，语言增强的模型始终优于无语言基线。
语言初始化的自适应（两投影或单投影）在 IC 与 OD 上显著优于随机初始化，提升性能。
少-shot 结果通常优于零-shot，与早期关于零-shot 主导性的发现相反。
在非常少量数据的设定中，线性探针常常优于全微调，而随着任务数据增多，微调可超过线性探针。
外部知识（WordNet、Wiktionary、GPT-3）在若干数据集上提升了零-/少-/全-shot 迁移，当合理利用时，GPT-3 提供更广的覆盖。
基于提示或知识整合的自适应（如类似 GLIP 的提示）在较少的可训练参数下也能实现具有竞争力或更优的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。