QUICK REVIEW

[논문 리뷰] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Zhicheng Huang, Zhaoyang Zeng|arXiv (Cornell University)|2020. 04. 02.

Multimodal Machine Learning Applications참고 문헌 40인용 수 286

한 줄 요약

Pixel-BERT는 끝-to-end Transformer 프레임워크에서 이미지 픽셀을 텍스트와 정렬하여 보편적 시각-언어 임베딩을 학습하며, 영역 기반 특징 없이 이미지-문장 쌍으로 사전 학습하고 VQA, NLVR2, 그리고 이미지-텍스트 검색에서 최첨단 성능을 달성한다.

ABSTRACT

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose to use a random pixel sampling mechanism to enhance the robustness of visual representation and to apply the Masked Language Model and Image-Text Matching as pre-training tasks. Extensive experiments on downstream tasks with our pre-trained model show that our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR). Particularly, we boost the performance of a single model in VQA task by 2.17 points compared with SOTA under fair comparison.

연구 동기 및 목표

시각 정보와 언어 의미를 직접 픽셀 수준에서 영역 기반 특징을 통해가 아닌 정렬하는 동기를 제시한다.
CNN 시각 인코더와 다중 모달 트랜스포머를 결합한 엔드투엔드 Pixel-BERT 모델을 제안한다.
루프가 없는 대규모 이미지-문장 데이터세트에서 MLM 및 ITM과 픽셀 샘플링 메커니즘을 사용하여 사전 학습의 강건성을 높인다.
이전의 영역 기반 접근법에 비해 VQA, NLVR2, 이미지-텍스트 검색 작업에서 성능 개선을 입증한다.

제안 방법

이미지 픽셀을 시각 임베딩으로 인코딩하기 위해 완전 합성곱 신경망(CNN) 백본을 사용한다.
언어를 BERT-스타일의 단어 수준 임베딩과 위치/의미 인코딩으로 임베딩한다.
시각-언어 임베딩을 하나의 Transformer에 결합하여 교차 모달 상호작용을 학습한다.
시각 입력에 조건화된 텍스트에 대한 마스킹된 언어 모델링(MLM)과 이미지-텍스트 매칭(ITM)으로 정렬 학습을 수행한다.
사전 학습 중 무작위 픽셀 샘플링 메커니즘을 도입하여 강건성과 과적합 감소를 향상시킨다.
다운스트림 작업에 대해 [CLS] 토큰을 각 작업별 분류기에 공급하여 미세조정한다.

실험 결과

연구 질문

RQ1픽셀 수준의 시각 표현이 텍스트와 함께 공동 학습될 때 영역 기반 특징을 넘어 교차 모달 이해를 향상시킬 수 있는가?
RQ2픽셀 수준 입력에 대한 MLM 및 ITM 사전 학습 과제가 시각-언어 정렬 및 다운스트림 작업 성능을 더 향상시키는가?
RQ3픽셀 수준의 교차 모달 어텐션은 영역 기반 방법과 비교해 VQA, NLVR2, 이미지-텍스트 검색에 어떤 영향을 미치는가?

주요 결과

Pixel-BERT가 ResNeXt-152 백본으로 VQA 테스트-std에서 74.55를 달성하여 다수의 이전 방법을 능가한다.
Pixel-BERT(x152)는 테스트-dev에서도 74.45에 도달하고 공정한 비교에서 VQA의 SOTA를 초과한다.
NLVR 2에서 Pixel-BERT는 test-P에서 77.2, dev에서 76.5를 달성하여 여러 페어 기반 베이스라인을 능가한다.
이미지-텍스트 검색에서 Pixel-BERT는 Unicoder-VL 및 UNITER 대비 실질적인 이점을 보이며 MS-COCO 및 Flickr30K 데이터셋에서 재현 지표가 향상된다.
절단 연구(Ablation studies)에 따르면 MLM과 ITM이 다운스트림 작업을 크게 개선하고, 픽셀 무작위 샘플링은 특히 검색 작업에서 추가 이점을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.