QUICK REVIEW

[논문 리뷰] Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue|arXiv (Cornell University)|2022. 04. 29.

Multimodal Machine Learning Applications인용 수 1,238

한 줄 요약

Flamingo는 고정된 대형 언어 모델을 Perceiver 기반 비주얼 리샘플러와 게이트드 크로스-어텐션을 통해 시각 입력과 인터리브하여 다양한 이미지/비디오 및 언어 과제에서 강력한 few-shot 학습을 달성하고, 태스크별 미세조정 없이 개방형 생성이 가능하도록 한다.

ABSTRACT

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

연구 동기 및 목표

새로운 다중모달 태스크에 대해 최소한의 주석 데이터로 빠른 적응을 촉진한다.
사전학습된 비전-단일 모델과 언어 모델을 연결하여 시각-텍스트 데이터의 인터리브를 처리한다.
이미지/비디오에 조건부인 개방형 언어 생성을 가능하게 한다.
다양한 비전-언어 벤치마크에서 few-shot 성능을 평가하고 설계 선택을 분석한다.

제안 방법

동결된 대형 언어 모델(Chinchilla)을 백본으로 사용하고 시각 입력에 조건을 다는 trainable cross-attention 블록을 삽입한다.
가변 크기 특징 맵으로부터 고정된 수의 비주얼 토큰을 생성하는 Perceiver Re-sampler로 이미지/비디오를 표현한다.
프롬프트에서 텍스트와 비주얼 토큰을 인터리브하여 모델이 이전 텍스트와 앞선 시각 정보를 조건으로 다음 텍스트 토큰을 예측하도록 한다.
인터리브 HTML 텍스트와 이미지, 이미지-텍스트 쌍, 비디오-텍스트 쌍으로 구성된 웹 스크랩 비전-언어 데이터의 혼합으로 컨텍스트 학습을 지원한다.
tan h-게이트드 크로스-어텐션 메커니즘을 도입하여 시각 정보를 융합하되 LM 가중치와 안정성을 보존한다.

실험 결과

연구 질문

RQ1태스크별 미세조정 없이도 비전-언어 모델이 다양한 다중모달 태스크를 few-shot 설정에서 수행할 수 있는가?
RQ2가변 입력 길이에서 인터리브된 시각 입력(이미지/비디오)에 동결된 LM을 조건지우는 데 가장 효과적인 아키텍처 구성요소는 무엇인가?
RQ3인터리브 및 쌍(pair) 비전-언어 데이터의 혼합으로 학습하는 것이 일반화 및 few-shot 적응에 어떤 영향을 미치는가?
RQ4few examples로의 in-context 프롬프팅이 캡션 생성 및 시각 질문 응답과 같은 개방형 태스크를 얼마나 잘 이끌 수 있는가?

주요 결과

Flamingo는 16개 다중모달 태스크에서 few-shot 학습의 새로운 최첨단 성능을 달성한다.
여섯 개 태스크에서 Flamingo는 태스크별 32개의 예시만으로도 미세조정된 SotA를 일치시키거나 능가한다.
모델 규모와 샷 수는 few-shot 성능을 향상시키며, 더 큰 모델은 더 많은 샷을 더 잘 활용한다.
게이트드 크로스-어텐션과 Perceiver Re-sampler를 갖춘 아키텍처는 고정된 LM을 인터리브된 비주얼에 조건화하면서 학습 안정성을 유지하게 한다.
Flamingo를 더 많은 데이터에 대해 미세조정하면 여러 태스크에서 SotA를 새로 설정한다(VQAv2, VATEX, VizWiz, MSRVTTQA, HatefulMemes).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.