QUICK REVIEW

[논문 리뷰] REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Yuanze Lin, Yujia Xie|arXiv (Cornell University)|2022. 06. 02.

Multimodal Machine Learning Applications인용 수 44

한 줄 요약

REVIVE는 지식 기반 VQA를 지역 기반 시각 표현을 도입하여 지식 검색 및 답변 생성을 모두 향상시키고 OK-VQA에서 최첨단을 달성한다. 객체 중심 영역, 명시적/암시적 지식, 그리고 FiD 기반 인코더-디코더를 이용해 다중 모달을 융합한다.

ABSTRACT

This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is extensively studied in traditional VQA, it is under-explored in knowledge-based VQA even though these two tasks share the common spirit, i.e., rely on visual input to answer the question. Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected; 2) visual features are not well utilized in the final answering model, which is counter-intuitive to some extent. Based on these observations, we propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions not only in the knowledge retrieval stage but also in the answering model. The key motivation is that object regions and inherent relationship are important for knowledge-based VQA. We perform extensive experiments on the standard OK-VQA dataset and achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin (+3.6%). We also conduct detailed analysis and show the necessity of regional information in different framework components for knowledge-based VQA. Code is publicly available at https://github.com/yzleroy/REVIVE.

연구 동기 및 목표

지향점: 객체 중심 지역 정보를 강조하여 지식 기반 VQA를 위한 시각적 표현을 개선한다.
연구 목표: 지역 기반 특징이 지식 검색 및 최종 답변 생성에 어떤 영향을 주는지 체계적으로 연구한다.
제안: 지역 특징, 명시적/암시적 지식, 그리고 트랜스포머 기반의 답변 모델을 REVIVE에 통합한다.
OK-VQA 데이터셋에서 최첨단 성능을 입증하고 구성요소의 기여를 분석한다.

제안 방법

GLIP로 객체 영역을 탐지하고 영역 기반 시각 특징을 추출한다.
CLIP을 통해 상위 지역 태그로 영역을 서술하고 캡션 모델(VinVL)로 컨텍스트를 생성한다.
영지식(위키데이터)으로 지역 기반 텍스트 설명과 CLIP 매칭을 이용해 명시적 지식을 검색한다.
지역 인식 프롬프트를 사용하여 GPT-3를 질의해 암시적 지식과 설명을 얻는다.
FiD를 활용하여 명시적/암시적 지식, 지역 시각 특징, 컨텍스트 인식 질문을 인코딩하고 정답을 디코딩한다.
지역 특징과 검색된 지식을 FiD 기반 인코더-디코더에서 융합해 정답을 생성한다.

실험 결과

연구 질문

RQ1지역 기반 시각 표현이 이미지 전반이나 슬라이딩 윈도우 방식에 비해 지식 기반 VQA 성능을 향상시키는가?
RQ2지역 정보로 검색된 명시적/암시적 지식이 정답 정확도에 어떻게 기여하는가?
RQ3지역 태그, 영역 수, 위치 좌표가 모델 성능에 어떤 영향을 미치는가?
RQ4FiD 기반 아키텍처가 지역 수준 시각 특징과 외부 지식을 효과적으로 통합하여 답변을 생성할 수 있는가?

주요 결과

Method	Knowledge Resources	Accuracy (%)
Q only	-	14.9
MLP	-	20.7
BAN	-	25.1
BAN+AN	Wikipedia	25.6
MUTAN	-	26.4
BAN+KG-AUG	Wikipedia+ConceptNet	26.7
MUTAN+AN	Wikipedia	27.8
ConceptBERT	ConceptNet	33.7
KRISP	Wikipedia + ConceptNet	38.4
Visual Retriever-Reader	Google Search	39.2
MAVEx	Wikipedia+ConceptNet+Google Images	39.4
PICa-Base	Frozen GPT-3 (175B)	43.3
PICa-Full	Frozen GPT-3 (175B)	48.0
KAT (Single)	Wikidata+Frozen GPT-3 (175B)	53.1
KAT (Ensemble)	Wikidata+Frozen GPT-3 (175B)	54.4
REVIVE (Single)	Wikidata+Frozen GPT-3 (175B)	56.6
REVIVE (Ensemble)	Wikidata+Frozen GPT-3 (175B)	58.0

REVIVE는 앙상블로 OK-VQA에서 58.0% 정확도를 달성하여 이전 SOTA(54.4% for KAT ensemble)를 능가한다.
단일 모델 REVIVE는 56.6% 정확도로 이전 단일 모델 벤치마크(KAT 단일 53.1%)를 능가한다.
지역 기반 지식 검색은 이미지 기반 및 슬라이딩 윈도우 방식보다 소수점 단위로 높은 성능을 보인다.
지역 태그(30) 및 영역 제안(36)이 ablation에서 최대 성능을 낸다.
좌표 위치 및 지역 중심 설명의 도입은 구성요소 전반에서 정확도를 일관되게 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.