QUICK REVIEW

[논문 리뷰] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Zhicheng Huang, Zhaoyang Zeng|arXiv (Cornell University)|2021. 04. 07.

Multimodal Machine Learning Applications참고 문헌 42인용 수 24

한 줄 요약

SOHO는 훈련 가능한 시각 인코더와 동적 시각 사전을 사용하여 이미지-텍스트 쌍에서 교차 모달 표현을 학습하는 엔드-투-엔드 비전-언어 프리트레이닝 모델을 제안하며, 추론 속도를 높이고 여러 VL 작업에서 최첨단에 준하는 이점을 얻는다.

ABSTRACT

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR$^2$ test-P split, 6.7% accuracy on SNLI-VE test split, respectively.

연구 동기 및 목표

바운딩 박스 영역 특징 없이 엔드-투-엔드 비전-언어 프리트레이닝을 동기부여한다.
밀집 시각 특징과 언어 토큰 사이의 의미적 격차를 해결한다.
컴팩트한 시각 토큰을 생성하고 학습 중 동적 업데이트를 가능하게 하는 시각 사전을 도입한다.
Masked Vision Modeling, Masked Language Modeling, and Image-Text Matching을 프리트레이닝 목표로 개발한다.

제안 방법

전체 이미지 특징을 추출하기 위해 학습 가능한 CNN 시각 인코더를 사용한다.
시각 특징을 k개의 클러스터 중심에 매핑하고 moving-average(모멘텀) 규칙으로 업데이트하는 시각 사전(VD)을 도입한다.
VD에 대해 미분 불가능한 최근접 이웃 매핑을 정의하고 end-to-end 학습을 가능하게 하기 위해 스톱 그래디언트 업데이트를 적용한다.
MLM, MVM, and ITM 세 가지 목표를 사용하여 프리트레이닝 목표를 구성한다.
도메인 내 VG 및 MSCOCO 데이터를 활용해 크로스 모달 표현을 학습하도록 프리트레이닝한다.
이미지-텍스트 검색, VQA, NLVR, 시각 함의 등 다운스트림 작업에 미세조정한다.

실험 결과

연구 질문

RQ1region 기반 특징 없이 엔드-투-엔드 VLPT가 이미지-텍스트 쌍으로부터 효과적인 교차 모달 표현을 배울 수 있는가?
RQ2동적으로 업데이트되는 시각 사전이 region-based 또는 grid-based 특징과 비교해 교차 모달 정렬을 개선하는가?
RQ3표준 VL 작업에서 SOHO의 성능 및 효율성 이점은 무엇인가?
RQ4최고의 교차 모달 학습을 위해 프리트레이닝 손실(MLM, MVM, ITM)을 어떻게 균형 있게 조합해야 하는가?

주요 결과

SOHO는 여러 VL 벤치마크에서 주목할 만한 개선을 달성: MSCOCO 텍스트 검색(5k 테스트 분할)에서 R@1 절대 이익 2.0%를 달성했다.
NLVR 2 test-P 분할에서 정확도 이익 1.5%를 달성했다.
SNLI-VE 테스트 분할에서 정확도 이익 6.7%를 달성했다.
VQA2.0 test-std 분할에서 VQA 점수 이익 0.56%를 달성했다.
SOHO의 추론은 엔드투엔드 처리와 지역 제안이 없기 때문에 지역 기반 BUTD 스타일 방법보다 약 10배 빠르다.
2048의 VD 크기가 다양한 작업에서 가장 좋은 성능 향상을 자주 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.