Skip to main content
QUICK REVIEW

[Paper Review] Multimodal Few-Shot Learning with Frozen Language Models

Maria Tsimpoukelli, Jacob Menick|arXiv (Cornell University)|Jun 25, 2021
Multimodal Machine Learning Applications35 references86 citations
TL;DR

Frozen transfers a pre-trained, frozen language model to multimodal tasks by training a vision encoder to produce a visual prefix that the language model attends to, enabling zero-shot and few-shot multimodal learning without updating the language model.

ABSTRACT

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

Motivation & Objective

  • Motivate extending few-shot language model capabilities to multimodal (vision-language) tasks without fine-tuning the language model.
  • Enable rapid adaptation to new multimodal tasks through in-context prompting with interleaved image and text inputs.
  • Show that a frozen language model can leverage encyclopedic knowledge for vision tasks and bound words to visual concepts quickly.
  • Demonstrate few-shot learning capabilities on diverse benchmarks including VQA, OKVQA, and miniImageNet in open-ended generation.

Proposed method

  • Use a pre-trained 7B autoregressive language model (Transformer) with frozen weights.
  • Train a vision encoder (NF-ResNet-50) to output a sequence of embeddings that form a visual prefix compatible with the language model.
  • Linearly map the vision encoder outputs to D-dimensional embeddings and reshape into n tokens to form the visual prefix.
  • Backpropagate gradients through the frozen language model to train only the vision encoder parameters.
  • Allow interleaving of image embeddings and text embeddings in the prompt, leveraging relative positional encodings for multiple images.
  • Evaluate in an open-ended, generative setting across zero-shot and few-shot scenarios, measuring token-based generation quality against ground-truth.

Experimental results

Research questions

  • RQ1Can a frozen large language model generate appropriate multimodal outputs when conditioned on a visual prefix produced by a trainable vision encoder?
  • RQ2Does prompting with interleaved sequences of images and text enable zero-shot and few-shot learning on multimodal tasks (VQA, captioning, and category binding)?
  • RQ3To what extent does the model leverage its encyclopedic knowledge for visual tasks (e.g., OKVQA) without task-specific fine-tuning?
  • RQ4How does the model perform on fast concept binding tasks (miniImageNet open-ended and real-name variants) under few-shot conditioning?

Key findings

  • Zero-shot transfer from image captioning to VQA outperforms a blind baseline and baseline finetuning, with Frozen achieving 29.5/35.7/38.2 across 0/1/4 shots on VQAv2 (Table 1).
  • Few-shot prompts improve VQA performance, approaching but not matching SGD training (e.g., 38.2% with four examples vs 48.4% with full VQA training, Table 1).
  • Performance on OKVQA scales with language model size, indicating encyclopedic knowledge contributes to multimodal reasoning without directly training on OKVQA.
  • Open-Ended miniImageNet results show substantial gains with higher inner-shots and more varied exemplars, demonstrating fast-binding of novel words to visual categories (Table 3).
  • Fast-VQA and Real-Fast-VQA indicate the model can incorporate recently learned words into multimodal questions, with performance improving as inner-shots increase (Table 5).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.