QUICK REVIEW

[논문 리뷰] VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su, Xizhou Zhu|arXiv (Cornell University)|2019. 08. 22.

Multimodal Machine Learning Applications참고 문헌 45인용 수 782

한 줄 요약

VL-BERT는 이미지-캡션 데이터와 텍스트 코퍼스를 바탕으로 단일 모델의 엔드-투-엔드 접근 방식으로 VCR, VQA, 참조 표현 작업에서 최첨단 결과를 달성하는 통합 시각-언어 트랜스포머를 도입합니다.

ABSTRACT

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{https://github.com/jackroos/VL-BERT}.

연구 동기 및 목표

여러 다운스트림 작업에 대해 미세 조정 가능한 일반적이고 사전 학습 가능한 시각-언어 표현을 개발한다.
유연한 교차 모달 주의를 갖춘 단일 Transformer 백본에 시각 RoI 특징과 언어 입력을 통합한다.
시각-언어 및 텍스트 전용 코퍼스를 대규모로 사전 학습시켜 시각-언어 신호를 맞추고 일반화를 향상시킨다.
단일 모델로 VCR, VQA, 및 참조 표현 이해에서 최첨단 성능을 입증한다.

제안 방법

Transformer 아키텍처를 확장하여 단어 입력과 RoI 입력을 하나의 통합 시퀀스로 처리한다.
입력을 토큰, 시각 특징, 세그먼트 및 위치 임베딩으로 표현하고 RoIs에 대한 새로운 시각 특징 임베딩을 포함한다.
시각-언어 데이터에 대해 두 가지 작업으로 사전 학습한다: 시각 단서를 활용한 마스킹된 언어 모델링(Masked Language Modeling with Visual Clues)과 언어 단서를 활용한 마스킹된 RoI 분류(Masked RoI Classification with Linguistic Clues).
Conceptual Captions(시각-언어)과 BooksCorpus/Wikipedia(텍스트 코퍼스)에 대해 1:1 샘플링 비율로 사전 학습한다.
다운스트림 작업을 위해 엔드-투-엔드로 미세 조정하고 작업 특화 입력/출력 형식을 사용한다(예: <Question, Answer, Image>, <Caption, Image>).

실험 결과

연구 질문

RQ1단일의 통합 Transformer 기반 모델이 여러 작업에 걸쳐 시각적 및 언어적 표현을 효과적으로 학습하고 정렬할 수 있는가?
RQ2시각-언어 데이터와 텍스트 전용 데이터의 공동 사전 학습이 단일 도메인 사전 학습에 비해 다운스트림 시각-언어 작업의 성능을 향상시키는가?
RQ3LM에서 시각 단서의 포함과 RoI 분류가 VCR, VQA, RefCOCO+와 같은 다운스트림 작업에 미치는 영향은 무엇인가?
RQ4사전 학습된 VL-BERT 모델이 다양한 벤치마크에서 단일 모델 아키텍처로 최첨단 결과를 달성할 수 있는가?

주요 결과

VL-BERT는 단일 통합 모델로 다중 시각-언어 작업에서 강력한 성능을 달성한다.
시각-언어 데이터로의 사전 학습은 최종 VCR 작업(Q→AR)에서 비사전 학습 기준선에 비해 약 1.0 포인트의 향상을 제공한다.
VL-BERT LARGE는 경쟁력 있는 결과를 달성한다: VCR val Q→A 75.5, QA→R 75.8; test Q→A 77.9, test QA→R 78.4; RefCOCO+ val 80.31, testA 83.62, testB 75.45; VQA test-dev 71.79, test-std 72.22.
VQA에서, VL-BERT BASE/LARGE는 비사전 학습 기준선을 능가하고 단일 모델 설정에서 일부 동시 방법을 능가한다(예: Large는 test-dev에서 71.79, test-std에서 72.22).
RefCOCO+에서 VL-BERT LARGE는 강력한 성능을 보인다(테스트A 83.62, 테스트B 62.30, 탐지된 영역으로).
VL-BERT는 발표 시점에 단일 모델 접근법 중에서 시각적 상식 추론(VCR)에서 최첨단 성능을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.