Skip to main content
QUICK REVIEW

[논문 리뷰] Multimodal Few-Shot Learning with Frozen Language Models

Maria Tsimpoukelli, Jacob Menick|arXiv (Cornell University)|2021. 06. 25.
Multimodal Machine Learning Applications참고 문헌 35인용 수 86
한 줄 요약

One or two sentence direct-answer summary

ABSTRACT

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

연구 동기 및 목표

  • Motivate extending few-shot language model capabilities to multimodal (vision-language) tasks without fine-tuning the language model.
  • Enable rapid adaptation to new multimodal tasks through in-context prompting with interleaved image and text inputs.
  • Show that a frozen language model can leverage encyclopedic knowledge for vision tasks and bound words to visual concepts quickly.
  • Demonstrate few-shot learning capabilities on diverse benchmarks including VQA, OKVQA, and miniImageNet in open-ended generation.

제안 방법

  • Use a pre-trained 7B autoregressive language model (Transformer) with frozen weights.
  • Train a vision encoder (NF-ResNet-50) to output a sequence of embeddings that form a visual prefix compatible with the language model.
  • Linearly map the vision encoder outputs to D-dimensional embeddings and reshape into n tokens to form the visual prefix.
  • Backpropagate gradients through the frozen language model to train only the vision encoder parameters.
  • Allow interleaving of image embeddings and text embeddings in the prompt, leveraging relative positional encodings for multiple images.
  • Evaluate in an open-ended, generative setting across zero-shot and few-shot scenarios, measuring token-based generation quality against ground-truth.

실험 결과

연구 질문

  • RQ1Can a frozen large language model generate appropriate multimodal outputs when conditioned on a visual prefix produced by a trainable vision encoder?
  • RQ2Does prompting with interleaved sequences of images and text enable zero-shot and few-shot learning on multimodal tasks (VQA, captioning, and category binding)?
  • RQ3To what extent does the model leverage its encyclopedic knowledge for visual tasks (e.g., OKVQA) without task-specific fine-tuning?
  • RQ4How does the model perform on fast concept binding tasks (miniImageNet open-ended and real-name variants) under few-shot conditioning?

주요 결과

n-샷 정확도n=0n=1n=4τ
Frozen29.535.738.2
Frozen_scratch0.00.00.0
Frozen_finetuned24.028.229.2
Frozen_train-blind26.233.533.3
Frozen_VQA48.4
Frozen_VQA-blind39.1
Oscar [23]73.8
  • Zero-shot transfer from image captioning to VQA outperforms a blind baseline and baseline finetuning, with Frozen achieving 29.5/35.7/38.2 across 0/1/4 shots on VQAv2 (Table 1).
  • Few-shot prompts improve VQA performance, approaching but not matching SGD training (e.g., 38.2% with four examples vs 48.4% with full VQA training, Table 1).
  • Performance on OKVQA scales with language model size, indicating encyclopedic knowledge contributes to multimodal reasoning without directly training on OKVQA.
  • Open-Ended miniImageNet results show substantial gains with higher inner-shots and more varied exemplars, demonstrating fast-binding of novel words to visual categories (Table 3).
  • Fast-VQA and Real-Fast-VQA indicate the model can incorporate recently learned words into multimodal questions, with performance improving as inner-shots increase (Table 5).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.