QUICK REVIEW

[논문 리뷰] Multimodal Few-Shot Learning with Frozen Language Models

Maria Tsimpoukelli, Jacob Menick|arXiv (Cornell University)|2021. 06. 25.

Multimodal Machine Learning Applications참고 문헌 35인용 수 86

한 줄 요약

One or two sentence direct-answer summary

ABSTRACT

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

연구 동기 및 목표

Motivate extending few-shot language model capabilities to multimodal (vision-language) tasks without fine-tuning the language model.
Enable rapid adaptation to new multimodal tasks through in-context prompting with interleaved image and text inputs.
Show that a frozen language model can leverage encyclopedic knowledge for vision tasks and bound words to visual concepts quickly.
Demonstrate few-shot learning capabilities on diverse benchmarks including VQA, OKVQA, and miniImageNet in open-ended generation.

제안 방법

Use a pre-trained 7B autoregressive language model (Transformer) with frozen weights.
Train a vision encoder (NF-ResNet-50) to output a sequence of embeddings that form a visual prefix compatible with the language model.
Linearly map the vision encoder outputs to D-dimensional embeddings and reshape into n tokens to form the visual prefix.
Backpropagate gradients through the frozen language model to train only the vision encoder parameters.
Allow interleaving of image embeddings and text embeddings in the prompt, leveraging relative positional encodings for multiple images.
Evaluate in an open-ended, generative setting across zero-shot and few-shot scenarios, measuring token-based generation quality against ground-truth.

실험 결과

연구 질문

RQ1Can a frozen large language model generate appropriate multimodal outputs when conditioned on a visual prefix produced by a trainable vision encoder?
RQ2Does prompting with interleaved sequences of images and text enable zero-shot and few-shot learning on multimodal tasks (VQA, captioning, and category binding)?
RQ3To what extent does the model leverage its encyclopedic knowledge for visual tasks (e.g., OKVQA) without task-specific fine-tuning?
RQ4How does the model perform on fast concept binding tasks (miniImageNet open-ended and real-name variants) under few-shot conditioning?

주요 결과

n-샷 정확도	n=0	n=1	n=4	τ
Frozen	29.5	35.7	38.2	✗
Frozen_scratch	0.0	0.0	0.0	✗
Frozen_finetuned	24.0	28.2	29.2	✗
Frozen_train-blind	26.2	33.5	33.3	✗
Frozen_VQA	48.4	–	–	✓
Frozen_VQA-blind	39.1	–	–	✓
Oscar [23]	73.8	–	–	✓

Zero-shot transfer from image captioning to VQA outperforms a blind baseline and baseline finetuning, with Frozen achieving 29.5/35.7/38.2 across 0/1/4 shots on VQAv2 (Table 1).
Few-shot prompts improve VQA performance, approaching but not matching SGD training (e.g., 38.2% with four examples vs 48.4% with full VQA training, Table 1).
Performance on OKVQA scales with language model size, indicating encyclopedic knowledge contributes to multimodal reasoning without directly training on OKVQA.
Open-Ended miniImageNet results show substantial gains with higher inner-shots and more varied exemplars, demonstrating fast-binding of novel words to visual categories (Table 3).
Fast-VQA and Real-Fast-VQA indicate the model can incorporate recently learned words into multimodal questions, with performance improving as inner-shots increase (Table 5).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.