QUICK REVIEW

[논문 리뷰] An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Zhengyuan Yang, Zhe Gan|arXiv (Cornell University)|2021. 09. 10.

Multimodal Machine Learning Applications인용 수 46

한 줄 요약

본 논문은 이미지 캡션/태그를 사용하고 소수-shot 인-컨텍스트 학습을 통해 GPT-3를 활용하는 프롬팅 기반 방법인 PICa를 도입하여 지식 기반 VQA를 수행하고, 미세 조정 없이 OK-VQA에서 소수-shot 최첨단 성능을 달성한다.

ABSTRACT

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

연구 동기 및 목표

GPT-3의 암묵적 지식과 추론 능력을 활용하는 간단한 비-미세-튜닝 지식 기반 VQA 접근 방식의 동기 부여.
명시적 지식 검색의 불일치 위험을 제거하기 위해 GPT-3를 텍스트 이미지 표현을 통한 암시적 지식 기반으로 사용.
이미지 텍스트 표현, 예시 선택, 다중 질의 앙상블이 소수-shot VQA 성능에 어떤 영향을 미치는지 체계적으로 연구.
엄격한 차등 분석과 정성적 분석을 통해 멀티모달 작업에서의 GPT-3의 잠재력과 한계를 Demonstrate.

제안 방법

이미지를 텍스트 설명(캡션 또는 태그)으로 변환하여 GPT-3에 피드합니다.
프롬프트 헤드와 소수의 인-context VQA 예시를 사용하여 GPT-3를 소수-shot 설정으로 프롬프트합니다.
캡션 vs 태그와 같은 이미지 표현 선택 및 프롬프트 간 다중 질의 앙상블을 통해 성능을 향상시킵니다.
유사도(CLIP/RoBERTa)를 이용한 관련성 있는 질문/이미지 선택으로 인-컨텍스트 예시를 선정하고, 필요시 다중 질의 앙상블을 통해 여러 답을 통합합니다.
GPT-3의 소수-shot 기능과 개방형 텍스트 생성을 이용해 미세 조정 없이 답을 생성합니다.
텍스트 표현, 인-컨텍스트 예시 선택, 다중 질의 앙상블이 성능에 미치는 영향을 이해하기 위한 차별 분석을 제공합니다.

실험 결과

연구 질문

RQ1GPT-3를 텍스트 이미지 설명이 주어졌을 때 비구조적 지식 기반으로 시각 기반 추론에 사용할 수 있는가?
RQ2이미지 텍스트 표현(캡션, 태그 또는 조합)이 소수-shot 지식 기반 VQA 성능에 어떤 영향을 미치는가?
RQ3인- 컨텍스트 예시 선택과 다중 질의 앙상블이 소수-shot regime에서 GPT-3 기반 VQA를 실질적으로 개선하는가?
RQ4OK-VQA 및 VQAv2와 같은 표준 벤치마크에서 GPT-3 기반 소수-shot VQA의 한계는 무엇인가?

주요 결과

Method	Image Repr.	n=0	n=1	n=4	n=8	n=16	Example engineering
Frozen (Tsimpoukelli et al. 2021)	Feature Emb.	5.9	9.7	12.6	-	-	✗
PICa-Base	Caption	17.5	32.4	37.6	39.6	42.0	✗
PICa-Base	Caption+Tags	16.4	34.0	39.7	41.8	43.3	✗
PICa-Full	Caption	17.7	40.3	44.8	46.1	46.9	✓
PICa-Full	Caption+Tags	17.1	40.8	45.4	46.8	48.0	✓

PICa는 16개의 인-context 예를 사용하는 캡션+태그 형태에서 OK-VQA에서 소수-shot 설정으로 감독 학습 최첨단을 능가한다(48.0% with 16 in-context examples using caption+tags).
캡션만 사용할 때도 PICa-Full은 OK-VQA에서 16샷으로 46.9%에 도달하여 이전의 감독 방법을 능가한다.
이미지를 풍부한 텍스트 설명(캡션, 태그 또는 둘 다)으로 표현하는 것이 질문만인 블라인드 베이스라인을 크게 앞선다.
인-컨텍스트 예시 선택과 다중 질의 앙상블은 OK-VQA 성능을 일관되게 향상시키며, 이상적 예시가 선택될 때 오버레이 같은 선택은 OK-VQA에서 49.1%에 근접한다.
VQAv2에서 PICa-Full은 캡션+태그로 소수-shot 설정에서 56.1%를 달성하여 Frozen 및 기존 베이스라인보다 현저히 나으나 Oscar의 감독학습 73.8%에는 아직 미치지 못한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.