QUICK REVIEW

[Paper Review] Multimodal Few-Shot Learning with Frozen Language Models

Maria Tsimpoukelli, Jacob Menick|arXiv (Cornell University)|Jun 25, 2021

Multimodal Machine Learning Applications35 references86 citations

TL;DR

Frozen transfers a pre-trained, frozen language model to multimodal tasks by training a vision encoder to produce a visual prefix that the language model attends to, enabling zero-shot and few-shot multimodal learning without updating the language model.

ABSTRACT

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

Motivation & Objective

Motivate extending few-shot language model capabilities to multimodal (vision-language) tasks without fine-tuning the language model.
Enable rapid adaptation to new multimodal tasks through in-context prompting with interleaved image and text inputs.
Show that a frozen language model can leverage encyclopedic knowledge for vision tasks and bound words to visual concepts quickly.
Demonstrate few-shot learning capabilities on diverse benchmarks including VQA, OKVQA, and miniImageNet in open-ended generation.

Proposed method

Use a pre-trained 7B autoregressive language model (Transformer) with frozen weights.
Train a vision encoder (NF-ResNet-50) to output a sequence of embeddings that form a visual prefix compatible with the language model.
Linearly map the vision encoder outputs to D-dimensional embeddings and reshape into n tokens to form the visual prefix.
Backpropagate gradients through the frozen language model to train only the vision encoder parameters.
Allow interleaving of image embeddings and text embeddings in the prompt, leveraging relative positional encodings for multiple images.
Evaluate in an open-ended, generative setting across zero-shot and few-shot scenarios, measuring token-based generation quality against ground-truth.

Experimental results

Research questions

RQ1Can a frozen large language model generate appropriate multimodal outputs when conditioned on a visual prefix produced by a trainable vision encoder?
RQ2Does prompting with interleaved sequences of images and text enable zero-shot and few-shot learning on multimodal tasks (VQA, captioning, and category binding)?
RQ3To what extent does the model leverage its encyclopedic knowledge for visual tasks (e.g., OKVQA) without task-specific fine-tuning?
RQ4How does the model perform on fast concept binding tasks (miniImageNet open-ended and real-name variants) under few-shot conditioning?

Key findings

Zero-shot transfer from image captioning to VQA outperforms a blind baseline and baseline finetuning, with Frozen achieving 29.5/35.7/38.2 across 0/1/4 shots on VQAv2 (Table 1).
Few-shot prompts improve VQA performance, approaching but not matching SGD training (e.g., 38.2% with four examples vs 48.4% with full VQA training, Table 1).
Performance on OKVQA scales with language model size, indicating encyclopedic knowledge contributes to multimodal reasoning without directly training on OKVQA.
Open-Ended miniImageNet results show substantial gains with higher inner-shots and more varied exemplars, demonstrating fast-binding of novel words to visual categories (Table 3).
Fast-VQA and Real-Fast-VQA indicate the model can incorporate recently learned words into multimodal questions, with performance improving as inner-shots increase (Table 5).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.