Skip to main content
QUICK REVIEW

[论文解读] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Zhiqiu Lin, Samuel Yu|arXiv (Cornell University)|Jan 16, 2023
Multimodal Machine Learning Applications被引用 8
一句话总结

论文表明,通过将多模态基础模型(如 CLIP、AudioCLIP)进行跨模态适应,可以通过将其他模态视为额外训练样本来提升单模态少-shot 分类,在简单线性探针下达到最先进结果并扩展到视听场景。

ABSTRACT

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for ${\bf cross-modal}$ ${\bf adaptation}$: we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

研究动机与目标

  • 通过利用跨模态信息,激发多模态信号在少-shot 学习中的歧义解决能力。
  • 提出一个轻量级的跨模态适应框架,将其他模态作为额外训练样本用于少-shot 任务。
  • 证明跨模态适应在多个数据集上可超越最先进的单模态适应方法。
  • 显示该方法扩展至除视觉-语言之外的音频与视听场景。

提出的方法

  • 将跨模态学习形式化为模态特异编码器映射到共享嵌入空间。
  • 训练一个线性分类器,输入同时包含视觉和辅助模态特征到同一分类器。
  • 将类别标签(文本)视为额外的单-shot 样本,将 n-shot 问题转化为 (n+1)-shot 问题。
  • 提供能够处理任意模态测试样本的推理,通过使用学得的跨模态权重。
  • 通过 Representer Theorem 将学得的分类器分析为对模态的集成。
  • 在 11 个数据集上进行基于固定少-shot 评估协议的 CLIP 和 AudioCLIP 视听语言适应实验。
Figure 2 : Adding additional modalities helps few-shot learning . Adding textual labels to a 2-shot cat-vs-dog classification task leads to better test performance (by turning the problem into a 3-shot cross-modal task!). We visualize cross-modal CLIP [ 21 ] features (projection to 2D with principal
Figure 2 : Adding additional modalities helps few-shot learning . Adding textual labels to a 2-shot cat-vs-dog classification task leads to better test performance (by turning the problem into a 3-shot cross-modal task!). We visualize cross-modal CLIP [ 21 ] features (projection to 2D with principal

实验结果

研究问题

  • RQ1额外的模态(文本、音频)是否可以作为额外训练样本来改善少-shot 视觉分类?
  • RQ2跨模态适应是否在不同数据集上相对于单模态微调或探针方法提供提升?
  • RQ3跨模态训练是否与现有适应技术(提示、适配器)正交且互补?
  • RQ4该方法是否可扩展到视听基准并提升图像与音频分类?

主要发现

  • 在 CoOp 协议下,使用简单线性探针的跨模态适应在 11 个数据集上实现了最先进的结果。
  • 将文本标签作为训练样本往往将 1-shot 任务转化为更有效的 2-shot 或 3-shot 场景,有时甚至优于更高 shot 的单模态方法。
  • 跨模态适应对单模态基线和其他适应方法(提示、适配器、鲁棒微调)提供持续提升,特别是在数据较少的情形下。
  • 对模态特异编码器进行部分微调进一步提升性能,在某些设置下实现新的 SOTA。
  • 扩展到音频的 AudioCLIP,并构建图像-音频基准,表明在大多数情况下增加另一模态的一-shot 样本能同时提升图像和音频分类。
  • 基于文本的增强(将类别名作为提示)仍然有益,并且可以与图像增强结合以提高鲁棒性。
Figure 3 : Cross-modality reduces the ambiguity of few-shot learning. Classic (uni-modal) few-shot learning is often under specified. Even for binary classification, when given only a single image per class ( left ), it is unclear whether the target class is the animal, the hat, or the background sc
Figure 3 : Cross-modality reduces the ambiguity of few-shot learning. Classic (uni-modal) few-shot learning is often under specified. Even for binary classification, when given only a single image per class ( left ), it is unclear whether the target class is the animal, the hat, or the background sc

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。