QUICK REVIEW

[논문 리뷰] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Hao Shao, Shengju Qian|arXiv (Cornell University)|2024. 03. 25.

Semantic Web and Ontologies인용 수 5

한 줄 요약

Visual CoT는 다중 모달 LLM에서 다회차, 영역 중심 추론을 가능하게 하는 Visual Chain-of-Thought 파이프라인과 373k-item의 visual CoT 데이터셋을 도입하여 해석 가능성과 성능을 향상시킵니다. 또한 벤치마크와 사전 학습된 VisCoT 모델도 제공합니다.

ABSTRACT

Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is small. To address these challenges, we collect and introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Additionally, about 98k pairs of them are annotated with detailed reasoning steps. Importantly, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We also introduce the related benchmark to evaluate the MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available on https://hao-shao.com/projects/viscot.html to support further research in this area.

연구 동기 및 목표

VQA 및 관련 작업에서 중간 시각적 Chain-of-Thought 감독의 부족 해결.
핵심 이미지 영역을 식별하고 집중하여 동적 다회 추론 가능하도록 함.
MLLM를 평가하기 위한 확장 가능한 Visual CoT 데이터셋과 벤치마크를 제공하여 영역별 추론을 평가합니다.

제안 방법

전역 이미지 영역과 지역화된 이미지 영역을 공동으로 사용하는 두 회전 다중 모달 파이프라인(VisCoT)을 제안합니다.
질문에 답하는 데 핵심 영역을 강조 표시하는 바운딩 박스로 주석된 373k QA 쌍의 Visual CoT 데이터셋을 만듭니다.
초기 캡션 기반 사전 학습 후 시각적 CoT 데이터로 미세 조정하는 두 단계 프로세스를 통해 VisCoT를 학습합니다.
지역 영역에 집중하고 영역 간 추론이 필요한 작업에서 MLLMs를 평가하기 위한 Visual CoT 벤치마크를 도입합니다.
시각적 샘플러를 사용하여 지역화된 이미지 영역을 자르고 전역 이미지 특징과 통합하여 MLLM에 제공합니다.

Figure 1 : Examples of the collected data with corresponding question-answer annotations and visual CoT bboxes. The red bboxes in the images highlight the critical image regions that provide necessary and related information for answering the questions.

실험 결과

연구 질문

RQ1시각적 CoT 데이터가 영역 중심 시각 추론 작업에서 MLLMs의 해석 가능성과 정확도를 향상시킬 수 있는가?
RQ2동적 영역 중심 처리 방식이 VQA, 읽기/정보그래픽, 관계 추론 데이터셋에서 성능에 어떤 영향을 미치는가?
RQ3경계 상자 품질 및 샘플링 전략이 Visual CoT의 효과에 미치는 영향은 무엇인가?

주요 결과

VisCoT는 Visual CoT 벤치마크에서 특히 문서/텍스트 및 고해상도 시각 작업에서 성능 향상을 달성합니다.
시각적 CoT를 사용하면 CoT가 없는 기본(Base라인) 대비 SROIE에서 최대 8×의 성능 향상을 얻을 수 있습니다.
시각적 CoT 파이프라인은 이미지 해상도가 낮고 시각적 토큰이 더 적은 경우에도 효과적일 수 있습니다.
GT 바운딩 박스는 상한선을 제공하고 성능을 크게 향상시켜 바운딩 박스 정확도가 중요함을 시사합니다.
병렬적 영역 확장 컨텍스트 자르기 및 지역화된 이미지 이후 원래 질문을 재구성하는 프롬프트가 정확도를 높이는 것으로 나타났습니다.
VisCoT는 여러 벤치마크에서 여러 SOTA 방법과 비교하여 우수한 일반화 성능을 보여 다양한 작업에서의 일반화를 입증합니다.

Figure 2 : Density distribution of the visual CoT dataset in terms of the relative proportion of bbox region to full image size. Different colors denote different source datasets. We find that text-oriented source datasets typically show smaller ratios compared to others.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.