QUICK REVIEW

[논문 리뷰] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, Xiaodong He|arXiv (Cornell University)|2017. 07. 25.

Multimodal Machine Learning Applications참고 문헌 55인용 수 94

한 줄 요약

논문은 Bottom-Up( Faster R-CNN의 영역 제안)과 top-down 어텐션 메커니즘을 결합하여 이미지 캡션 작성과 VQA를 위한 주목할 만한 이미지 영역에 대한 어텐션을 가능하게 한다. 이 방법은 MSCOCO 캡션 작성에서 최첨단 성과를 달성하고 2017 VQA Challenge에서 우승한다.

ABSTRACT

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

연구 동기 및 목표

객체와 두드러진 영역 수준에서 이미지를 고정된 격자 대신 주의하는 것을 동기화한다.
Faster R-CNN을 통해 영역 기반 특징을 제안하는 bottom-up 어텐션 메커니즘을 개발한다.
captioning과 VQA를 위해 bottom-up 영역과 top-down 어텐션 메커니즘을 통합하여 성능을 향상시킨다.
영역 기반 어텐션이 표준 평가 지표 전반에서 개선을 가져다줌을 입증한다.

제안 방법

이미지 특징 V를 ResNet-101을 갖춘 bottom-up Faster R-CNN에 의해 생성된 영역 특징의 집합으로 정의하고, objectness > threshold인 영역을 선택한다.
작업 맥락(Captioning 또는 VQA)에 조건화된 V에 대한 주의 가중치를 계산하기 위해 top-down 어텐션 메커니즘을 사용한다.
캡션 작성의 경우 V에 대한 소프트 어텐션을 갖는 두 개의 LSTMs(하나는 top-down 주의, 하나는 언어 모델링용)을 사용한다.
VQA의 경우 고정된 어휘 집합에서 답을 예측하기 위해 어텐션 가중치가 부여된 이미지 특징을 포함하는 공동 다중 모달 임베딩을 구현한다.
교차 엔트로피 손실로 학습하고 CIDEr 점수를 최적화하기 위해 Self-Critical Sequence Training (SCST)로 개선한다.
선택적으로 bottom-up 어텐션의 이득을 정량화하기 위해 ResNet-베이스라인과 비교한다.

실험 결과

연구 질문

RQ1bottom-up, 영역 기반 어텐션이 그리드 기반 어텐션에 비해 이미지 캡션 품질에 어떤 영향을 미치는가?
RQ2동일한 bottom-up 어텐션 프레임워크가 Visual Question Answering 성능을 개선할 수 있는가?
RQ3객체 수준의 주의가 캡션 및 VQA에서 객체 식별, 속성 및 관계 인식에 어떤 영향을 미치는가?

주요 결과

Bottom-up 어텐션은 MSCOCO에서 CIDEr, SPICE, BLEU-4와 같은 지표 전반에서 상당한 향상을 주며 최첨단 결과를 달성한다.
MSCOCO Karpathy test split에서 Up-Down(하향식 어텐션 포함)이 ResNet 베이스라인보다 지표 전반에서 3–8%의 향상을 보인다.
VQA는 Bottom-up 어텐션으로 2017 VQA Challenge에서 70.3% 전체 정확도(VQA v2.0 test-standard 서버 기준)로 1위를 달성했다.
정성적 어텐션 시각화는 미세한 디테일과 큰 영역 모두에 주의가 기울여져 단어 수준의 근거 제시가 가능하다는 것을 보여준다.
ResNet 베이스라인과 비교할 때, Up-Down 모델은 VQA v2.0 검증 및 테스트 세트에서 Yes/No, Number, Other 질문 유형에서 개선을 보였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.