QUICK REVIEW

[논문 리뷰] LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Hao Tan, Mohit Bansal|arXiv (Cornell University)|2019. 08. 20.

Multimodal Machine Learning Applications참고 문헌 44인용 수 223

한 줄 요약

LXMERT는 비전-언어 표현을 학습하기 위한 three-encoder transformer 모델을 도입하고, 다섯 가지 multimodal 태스크로 사전 학습하여 VQA와 GQA에서 최첨단 성과를 달성하고 NLVR2에서 주목할 만한 이득을 얻는다.

ABSTRACT

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pre-training strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert

연구 동기 및 목표

시각 개념, 언어 의미, 그리고 이들 간의 교차 모달 정렬을 모델링하여 비전-언어 추론에 동기를 부여한다.
언어, 객체 관계, 공통 추론을 위한 전용 인코더를 갖춘 교차 모달 Transformer 아키텍처를 제안한다.
다양한 다중모달 태스크를 사용하여 대규모 이미지-문장 말뭉치에서 사전 학습하여 내재 모달 및 교차 모달 의존성을 포착한다.
VQA와 GQA에서 최첨단 성능을 보여주고, 미세조정과 절단 실험을 통해 NLVR2로의 일반화를 입증한다.

제안 방법

세 인코더: 언어 인코더, 객체 관계 인코더, 그리고 셀프 어텐션 및 크로스 어텐션 계층으로 구성된 교차 모달 인코더.
감지된 객체 탐지기로부터의 위치 인식 객체 RoI 임베딩과 단어 수준 문장 표현을 결합한 입력 임베딩.
다섯 가지 사전 학습 태스크: (i) 마스킹된 교차 모달 언어 모델링, (ii) 마스킹된 객체에 대한 RoI 피처 회귀, (iii) 마스킹된 객체에 대한 탐지 라벨 분류, (iv) 교차 모달 매칭, (v) 이미지 질문 응답(QA).
교차 모달 어텐션은 언어와 시각 간 양방향 정보 교환을 가능하게 하며, 다수의 계층을 쌓은 전용 교차 모달 인코더가 있다.
효율성을 위해 이미지당 고정 객체 수 36개를 유지하고, 이미지-문장 데이터의 대규모 혼합(9.18M 쌍, 1억 단어, 650만 객체)으로 학습.
일반화 및 태스크 적응을 평가하기 위해 VQA, GQA, NLVR2 데이터 세트에서 사전 학습 가중치로부터 미세 조정.

실험 결과

연구 질문

RQ1명시적 교차 모달 상호작용을 갖춘 시각과 언어를 공동으로 모델링하도록 Transformer 기반 아키텍처를 어떻게 설계할 수 있을까?
RQ2시각-언어 태스크를 위해 내재 모달 및 교차 모달 관계를 가장 잘 포착하는 사전 학습 목표는 무엇인가?
RQ3단일 모달이나 언어 중심의 사전학습에 비해 교차 모달 사전학습 모델이 VQA, GQA, NLVR2 성능을 얼마나 향상시킬 수 있는가?
RQ4모델 구성요소 및 사전학습 태스크의 제거(Abalation)가 다운스트림 비전-언어 추론 성능에 어떠한 영향을 미치는가?

주요 결과

LXMERT는 표준 지표에서 VQA와 GQA에 대해 최첨단 결과를 달성한다.
NLVR2에서 미세조정은 22 퍼센트 포인트의 큰 절대 향상(54%에서 76% 정확도)을 가져온다.
절단 연구는 새로운 모델 구성요소(객체 관계 인코더와 교차 모달 인코더)와 다양한 사전 학습 태스크가 성능 향상에 상당히 기여함을 보여준다.
이미지 QA 태스크 없는 교차 모달 사전학습은 성능이 떨어지며, 비전-언어 표현에 대한 이미지-질문 응답 데이터의 이점을 강조한다.
언어, 객체 관계 및 교차 모달 인코더의 어텐션 시각화는 모델이 텍스트와 시각적 요소를 어떻게 연결하는지 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.