[논문 리뷰] Multimodal Few-Shot Learning with Frozen Language Models
One or two sentence direct-answer summary
When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
연구 동기 및 목표
- Motivate extending few-shot language model capabilities to multimodal (vision-language) tasks without fine-tuning the language model.
- Enable rapid adaptation to new multimodal tasks through in-context prompting with interleaved image and text inputs.
- Show that a frozen language model can leverage encyclopedic knowledge for vision tasks and bound words to visual concepts quickly.
- Demonstrate few-shot learning capabilities on diverse benchmarks including VQA, OKVQA, and miniImageNet in open-ended generation.
제안 방법
- Use a pre-trained 7B autoregressive language model (Transformer) with frozen weights.
- Train a vision encoder (NF-ResNet-50) to output a sequence of embeddings that form a visual prefix compatible with the language model.
- Linearly map the vision encoder outputs to D-dimensional embeddings and reshape into n tokens to form the visual prefix.
- Backpropagate gradients through the frozen language model to train only the vision encoder parameters.
- Allow interleaving of image embeddings and text embeddings in the prompt, leveraging relative positional encodings for multiple images.
- Evaluate in an open-ended, generative setting across zero-shot and few-shot scenarios, measuring token-based generation quality against ground-truth.
실험 결과
연구 질문
- RQ1Can a frozen large language model generate appropriate multimodal outputs when conditioned on a visual prefix produced by a trainable vision encoder?
- RQ2Does prompting with interleaved sequences of images and text enable zero-shot and few-shot learning on multimodal tasks (VQA, captioning, and category binding)?
- RQ3To what extent does the model leverage its encyclopedic knowledge for visual tasks (e.g., OKVQA) without task-specific fine-tuning?
- RQ4How does the model perform on fast concept binding tasks (miniImageNet open-ended and real-name variants) under few-shot conditioning?
주요 결과
| n-샷 정확도 | n=0 | n=1 | n=4 | τ |
|---|---|---|---|---|
| Frozen | 29.5 | 35.7 | 38.2 | ✗ |
| Frozen_scratch | 0.0 | 0.0 | 0.0 | ✗ |
| Frozen_finetuned | 24.0 | 28.2 | 29.2 | ✗ |
| Frozen_train-blind | 26.2 | 33.5 | 33.3 | ✗ |
| Frozen_VQA | 48.4 | – | – | ✓ |
| Frozen_VQA-blind | 39.1 | – | – | ✓ |
| Oscar [23] | 73.8 | – | – | ✓ |
- Zero-shot transfer from image captioning to VQA outperforms a blind baseline and baseline finetuning, with Frozen achieving 29.5/35.7/38.2 across 0/1/4 shots on VQAv2 (Table 1).
- Few-shot prompts improve VQA performance, approaching but not matching SGD training (e.g., 38.2% with four examples vs 48.4% with full VQA training, Table 1).
- Performance on OKVQA scales with language model size, indicating encyclopedic knowledge contributes to multimodal reasoning without directly training on OKVQA.
- Open-Ended miniImageNet results show substantial gains with higher inner-shots and more varied exemplars, demonstrating fast-binding of novel words to visual categories (Table 3).
- Fast-VQA and Real-Fast-VQA indicate the model can incorporate recently learned words into multimodal questions, with performance improving as inner-shots increase (Table 5).
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.