QUICK REVIEW

[논문 리뷰] Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Junnan Li, Ramprasaath R. Selvaraju|arXiv (Cornell University)|2021. 07. 16.

Multimodal Machine Learning Applications참고 문헌 60인용 수 822

한 줄 요약

ALBEF는 합성 전 이미지-텍스트 대조 정렬을 도입하고 모멘텀 distillation을 통해 노이즈가 많은 웹 데이터를 활용하며, 경계 상자 없이 다수의 비전-언어 작업에서 최첨단 결과를 달성합니다. 그것은 단일 모달 및 다중 모달 표현을 공동으로 학습하고, 자기지도 학습을 위한 의사 타깃을 생성하기 위해 모멘텀 티처를 사용합니다.

ABSTRACT

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at https://github.com/salesforce/ALBEF/.

연구 동기 및 목표

이미지와 텍스트 표현을 융합 전에 정렬하는 detector-free 비전-언어 사전학습 프레임워크를 동기부여하고 개발한다.
단일 모달 인코더를 고정시키고 교차 모달 학습을 촉진하기 위해 중간 이미지-텍스트 대조 손실(ITC)을 제안한다.
의사 타깃을 생성하기 위해 모멘텀 평균 선생을 사용하여 노이즈가 많은 웹 데이터로부터의 학습을 개선하는 Momentum Distillation (MoD)을 도입한다.
ALBEF가 검색, VQA, NLVR2, VE 및 약한 지도 학습 기반의 grounding에서 잘 작동하는 견고한 비전-언어 표현을 학습함을 보여준다.
설계 선택을 정당화하기 위한 상호 정보 최대화를 통한 이론적 프레이밍을 제공한다.]
method:[
이미지와 텍스트를 detector-free ViT 기반 이미지 인코더로 인코딩하고 텍스트는 트랜스포머 기반 텍스트 인코더로 인코딩한다.
각 계층마다 교차 모달 어텐션을 갖춘 6-layer 다중 모달 트랜스포머를 통해 이미지와 텍스트를 융합한다.
융합 전에 이미지와 텍스트를 정렬하기 위해 단일 모달 표현에 이미지-텍스트 대조 손실(ITC)을 적용한다.
다중 모달 인코더에서 이미지-텍스트 매치를 강화하기 위해 배치 내 하드 네거티브 마이닝을 사용한다.
멀티모달 상호작용을 학습하기 위해 MLM과 ITM 손실로 학습하고, L = L_itc + L_mlm + L_itm로 합동 손실을 최적화한다.
Momentum Distillation (MoD)을 도입: 모멘텀 모델을 유지하여 ITC 및 MLM 손실에 대한 의사 타깃을 생성하고, α (0.4)의 가중치를 사용하여 원래 손실과 혼합함으로써 학습 및 다운스트림 성능을 향상시킨다.]
research_questions: [
융합 전에 중간 ITC 손실이 있는 detector-free VLP 접근이 교차 모달 상호 작용 학습을 향상시킬 수 있는가?
명시적 바운딩 박스 없이 노이즈가 많은 웹 규모의 비전-언어 데이터에서 모멘텀 디스턴션이 효과적으로 학습을 가능하게 하는가?
상호 정보 최대화 관점에서 ITC, MLM, ITM 및 MoD가 비전-언어 표현을 개선하기 위해 어떻게 상호작용하는가?
최신 방법과 비교하여 이미지-텍스트 검색, VQA, NLVR2, 비주얼 엔 entailment, 약한 지도 grounding에서 ALBEF 변형이 어떤 성능 향상을 달성하는가?

제안 방법

를 번역해야 합니다

실험 결과

연구 질문

주요 결과

ALBEF는 이미지-텍스트 검색에서 최첨단 성능을 달성하며, 훨씬 더 큰 데이터로 학습된 방법들을 능가한다.
detector-free 입력으로도 ALBEF는 VQA, NLVR2, VE 작업에서 경쟁력 있거나 우수한 결과를 달성하고, 탐지기 기반 방법보다 추론 속도가 더 빠르다.
Momentum Distillation (MoD)은 사전 학습과 다운스트림 작업을 개선하여 더 크고 더 노이즈가 많은 웹 데이터로의 학습을 가능하게 한다.
ITC, MLM, ITM 및 MoD를 결합한 ALBEF는 MLM+ITM 및 hard-negative ITM 변형과 같은 기준(Base라인)에 비해 다수의 작업에서 상당한 이득을 보인다.
상호 정보 관점은 이미지-텍스트 표현 간 MI를 최대화하기 위해 다양한 뷰를 생성하는 것으로 ALBEF 구성 요소를 설명한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.