QUICK REVIEW

[논문 리뷰] A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

Yunxin Li, Longyue Wang|arXiv (Cornell University)|2023. 11. 13.

Multimodal Machine Learning Applications인용 수 8

한 줄 요약

본 연구는 상식, 세부적 세계지식, 의사결정 추론에 걸친 지식집약적 VQA에서 GPT-4V를 벤치마크하여 최첨단 성능을 보였으나 세계지식 환각과 시각적 단서에 대한 의존성의 약점을 지적했다.

ABSTRACT

The emergence of multimodal large models (MLMs) has significantly advanced the field of visual understanding, offering remarkable capabilities in the realm of visual question answering (VQA). Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate not just recognition of visual elements, but also a deep comprehension of the visual information in conjunction with a vast repository of learned knowledge. To uncover such capabilities of MLMs, particularly the newly introduced GPT-4V and Gemini, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing their proficiency across various specialized fields; 3) Comprehensive Knowledge with Decision-making Rationales, which examines model's capability to provide logical explanations for its inference, facilitating a deeper analysis from the interpretability perspective. Additionally, we utilize a visual knowledge-enhanced training strategy and multimodal retrieval-augmented generation approach to enhance MLMs, highlighting the future need for advancements in this research direction. Extensive experiments indicate that: a) GPT-4V demonstrates enhanced explanation generation when using composite images as few-shots; b) GPT-4V and other MLMs produce severe hallucinations when dealing with world knowledge; c) Visual knowledge enhanced training and prompting technicals present potential to improve performance. Codes: https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper

연구 동기 및 목표

OK-VQA 파생 프롬프트와 합성 이미지 프롬팅을 사용하여 상식 지식 VQA에서 GPT-4V 및 다른 MLM들을 평가한다.
여러 도메인에 걸친 INFOSEEK 파생 샘플을 사용하여 세밀한 세계 지식에 대해 GPT-4V를 평가한다.
A-OKVQA 합리화를 이용한 의사결정 추론 능력을 GPT-4V가 얼마나 잘 하는지 조사한다.
Few-shot 대 Zero-shot 프롬팅 전략을 비교하고 추론 능력 및 해석 가능성을 분석한다.
일반적인 실패 모드를 식별하고 지식 기반 다중모달 모델의 개선에 대한 시사점을 제시한다.

제안 방법

OK-VQA, INFOSEEK, A-OKVQA 데이터셋을 평가 하위집합으로 재구성하여 지식집약적 VQA 벤치마크를 구성한다.
짧은 답변에 대해 정확 일치를 사용하고 자동 지표(BLEU, CIDER, METEOR)와 함께 합리화에 대한 인간 평가를 병행한다.
상식, 물리, 세계, 시각 지식에 걸친 참조 샘플을 제공하기 위해 GPT-4V에 합성 인-context 프롬팅을 적용한다.
오픈소스 MLM과 GPT-4V를 제로샷 및 few-shot 설정에서 평가한다.
의사결정 추론을 통한 추론을 분석하고 일관성, 충분성, 사실적 정확성에 대한 인간 판단으로 해석 가능성을 평가한다.
합성 이미지에 포함된 맥락 내 참조 예제의 영향과 프롬팅 방법의 효율성을 검토한다.

실험 결과

연구 질문

RQ1다양한 카테고리에 걸친 상식 지식 VQA에서 다중 모달 대형 모델의 성능은 어떠한가?
RQ2엔터티별 정보가 필요한 세밀한 세계 지식 VQA 작업에서 모델은 얼마나 잘 처리하는가?
RQ3GPT-4V 및 동료들은 VQA 답변에 대한 신뢰할 수 있는 의사결정 추론을 생성할 수 있는가?
RQ4프롬팅 전략(zero-shot 대 few-shot)과 합성 이미지 프롬핑이 성능 및 해석 가능성에 미치는 영향은 무엇인가?
RQ5지식집약적 VQA에서 GPT-4V의 주요 실패 모드는 무엇이며 이를 어떻게 완화할 수 있는가?

주요 결과

GPT-4V는 상식 지식, 세밀한 세계 지식 및 추론 생성 태스크에서 최첨단 성능을 달성한다.
GPT-4V는 합성 이미지 few-shot 프롬프트를 통해 추론 및 설명이 향상된다.
GPT-4V는 세계지식 질문에서 상당한 환각을 보이며 지식 grounding의 개선 필요성을 시사한다.
GPT-4V는 합성 이미지는 잘 다루지만 시각을 오해하거나 시각적 단서에 과도하게 의존하는 경향이 있어 일부 카테고리에서 답에 영향을 준다.
오픈소스 MLM들은 많은 지식집약적 VQA 태스크에서 GPT-4V보다 뒤처지며 카테고리 간 장返 성능 격차가 크게 나타난다.
Few-shot 프롬핑은 일부 도메인에서 GPT-4V의 성능을 향상시킬 수 있지만 모델과 카테고리에 따라 이점이 다양하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.