QUICK REVIEW

[论文解读] Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue|arXiv (Cornell University)|Apr 29, 2022

Multimodal Machine Learning Applications被引用 1,238

一句话总结

Flamingo 是一个视觉语言模型，通过在交错的视觉输入上对一个冻结的大型语言模型进行条件化，借助基于 Perceiver 的视觉重采样器和平门控跨注意力，在多样的图像/视频和语言任务中实现强大的 few-shot 学习，从而在无需任务特定微调的情况下实现开放式生成。

ABSTRACT

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

研究动机与目标

以尽量少的标注数据促使对新颖多模态任务的快速适应。
桥接预训练的纯视觉模型和语言模型，以处理交错的视觉与文本数据。
使开放式语言生成以图像/视频为条件，而非输出固定结果。
在多样化的视觉-语言基准测试上评估 few-shot 性能并分析设计选择。

提出的方法

以冻结的大型语言模型（Chinchilla）作为骨干，并插入可训练的跨注意力块以对视觉输入进行条件化。
使用 Perceiver Resampler 来表示图像/视频，从可变大小的特征图中生成固定数量的视觉标记。
在提示中交错文本和视觉标记，使模型在前序文本和前面的视觉信息的条件下预测下一个文本标记。
在混合的网页抓取视觉-语言数据上进行训练（交错 HTML 文本与图像、图像-文本对以及视频-文本对），以支持上下文学习。
采用 tanh 门控跨注意力机制来融合视觉信息，同时保持 LM 权重和稳定性。

实验结果

研究问题

RQ1在没有任务特定微调的情况下，视觉语言模型能否在 few-shot 设置中完成多样的多模态任务？
RQ2哪些架构组件最能实现对冻结 LM 在可变输入长度的交错视觉输入（图像/视频）上的条件化？
RQ3在交错与成对的视觉-语言数据混合训练如何影响泛化和 few-shot 适应？
RQ4在上下文提示中使用少量示例在多大程度上推动开放式任务，如字幕生成和视觉问答？

主要发现

Flamingo 在广泛的 16 项多模态任务上实现了少样本学习的新-state-of-the-art。
在六项任务上，Flamingo 仅用 32 个任务特定示例就达到或超过微调后的 SotA。
模型规模和 shot 数量提升 few-shot 性能，较大模型能更好地利用更多的样本。
带门控跨注意力和 Perceiver Resampler 的架构实现了对冻结 LM 在交错视觉输入上的条件化，同时保持训练稳定性。
在更多数据上微调 Flamingo 为若干任务（VQAv2, VATEX, VizWiz, MSRVTTQA, HatefulMemes）设定了新的 SotA。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。