QUICK REVIEW

[논문 리뷰] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Chao Jia, Yinfei Yang|arXiv (Cornell University)|2021. 02. 11.

Multimodal Machine Learning Applications참고 문헌 75인용 수 1,195

한 줄 요약

ALIGN은 대규모의 노이즈 이미지-대체 텍스트 말뭉치에서 시각 및 시각-언어 임베딩을 대 dual-encoder와 대조적 손실을 사용하여 학습시키고, 비전 및 교차 모달 검색 작업에서 제로샷 및 미세 조정 성능을 최첨단으로 달성합니다.

ABSTRACT

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

연구 동기 및 목표

비용이 많이 드는 데이터 큐레이션 없이 확장 가능한 시각 및 시각-언어 표현 학습 동기 부여.
대규모의 노이즈 이미지-대체 텍스트 쌍에 대해 contrastive loss로 학습된 간단한 dual-encoder 아키텍처를 제안.
스케일이 노이즈를 보상하여 비전 및 교차 모달 태스크 전반에서 강력한 전달 성능을 달성할 수 있음을 시연.

제안 방법

이미지 인코더로 EfficientNet을, 텍스트 인코더로 BERT를 사용하고 공유 임베딩 공간을 구성.
그림-텍스트 방향과 텍스트-그림 방향 모두에서 정규화된 소프트맥스 대조 손실로 학습.
Heavy post-processing 없이 Conceptual Captions에서 1.8B 이미지-대체 텍스트 데이터셋을 구성하기 위해 빈도 기반 필터링 최소화.
Flickr30K와 MSCOCO에서 제로샷 및 미세 조정 검색 평가, Crisscrossed Captions (CxC)와 같은 교차 모달 벤치마크 평가.
텍스트 인코더에서 클래스 이름 프롬프트를 질의하여 이미지넷 분류의 제로샷 성능을 보임.

실험 결과

연구 질문

RQ1매우 큰 노이즈 이미지-텍스트 데이터셋에서 학습된 간단한 dual-encoder가 heavy filtering 없이도 교차 모달 검색에서 최첨단 성능을 낼 수 있는가?
RQ2스케일링과 데이터 품질이 시각 및 시각-언어 표현 학습에 어떤 trade-off를 보이는가?
RQ3제로샷 및 미세 조정 설정에서 이미지 분류 및 이미지-텍스트 검색에서 어떤 전달 성능을 달성할 수 있는가?
RQ4다국어 확장이 비영어 데이터로의 교차 모달 검색-generalize를 가능하게 하는가?
RQ5학습된 임베딩의 질적 속성(구성성, 텍스트+이미지 질의 능력)은 어떠한가?

주요 결과

ALIGN은 제로샷 및 미세 조정 설정에서 Flickr30K 및 MSCOCO의 이미지-텍스트 검색에서 최첨단 성능을 달성합니다.
ImageNet의 제로샷 이미지 분류에서 ALIGN은 클래스 이름 프롬프트를 사용해 상위 1등 정확도 76.4%를 달성하며 CLIP과 비슷한 성능을 보입니다.
이미지넷에서 시각 분류 태스크에서 이미지 인코더만으로 상위 1의 정확도 88.64%에 도달합니다.
CxC 검색 및 SITS 지표에서 이전 VSE 및 교차 주의 모델에 비해 상당한 이득이 있으며, 특히 이미지-텍스트 및 텍스트-이미지 회상에서 큰 개선이 있습니다.
다국어 ALIGN 모델(ALIGN mling)은 100개 이상의 언어로 학습되어 Multi30K에서 제로샷 다국어 이미지-텍스트 검색에서 일부 기준선을 능가하며, 다언어 간 일반화 능력을 보여줍니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.