QUICK REVIEW

[논문 리뷰] Multimodal Masked Autoencoders Learn Transferable Representations

Xinyang Geng, Hao Líu|arXiv (Cornell University)|2022. 05. 27.

Multimodal Machine Learning Applications인용 수 29

한 줄 요약

M3AE는 모달리티별 인코더나 대조학습 없이 마스킹 토큰 재구성을 통해 통합된 비전-언어 표현을 학습하며, 다운스트림 태스크인 ImageNet 선형 분류 및 OOD 탐지에 전달 가능한 표현을 달성한다.

ABSTRACT

Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data.

연구 동기 및 목표

시각 및 언어 간에 전이 가능한 표현을 학습하도록 오직 마스킹 토큰 예측으로 학습된 대규모 다중모달 모델이 학습될 수 있는지 조사한다.
두 모달리티에 대해 모달리티별 인코더 없이 통합 인코더를 사용하는 간단하고 확장 가능한 아키텍처를 개발한다.
대규모 이미지-텍스트 데이터에 대한 다중모달 사전학습이 이미지 분류 및 OOD 탐지와 같은 다운스트림 태스크의 성능에 어떤 영향을 미치는지 평가한다.
하나의 학습 프레임워크에서 페어 데이터와 비페어 데이터를 모두 활용하는 모델의 능력을 평가한다.

제안 방법

이미지-텍스트 쌍을 이미지 패치 + 텍스트 토큰의 긴 시퀀스로 처리한다.
이미지 패치와 텍스트 토큰의 높은 비율을 마스킹하고 통합 트랜스포머 인코더-디코더를 통해 누락된 부분을 재구성한다.
모달리티별 임베딩과 공유된 CLS 토큰을 사용해 두 모달리티를 공통 표현 공간으로 매핑한다.
마스킹된 요소에 대해서만 재구성 목표를 적용해 재구성 오차를 최소화한다: 마스킹된 이미지 패치에 대해 MSE, 마스킹된 텍스트 토큰에 대해 크로스 엔트로피 손실을 사용한다.
페어 및 비페어 데이터의 혼합 학습이 가능하도록 하여 대조손실 없이도 데이터 활용의 유연성을 확보한다.

실험 결과

연구 질문

RQ1M3AE가 ImageNet 분류 및 OOD 탐지와 같은 다운스트림 태스크로 전달될 수 있는 일반화된 표현을 학습할 수 있는가?
RQ2학습된 표현이 이미지 및 언어 모달리티 모두의 의미 정보를 반영하는가?
RQ3모델 크기, 학습 시간, 마스킹 전략이 성능 및 전이성에 어떤 영향을 미치는가?
RQ4M3AE가 페어 이미지-텍스트 데이터와 비페어 데이터를 하나의 학습 목표 내에서 효과적으로 활용할 수 있는가?

주요 결과

모델	MAE	M3AE	CLIP	감독학습
정확도	44.6	61.3	69.0	81.8
M3AE 텍스트 비율	10%	20%	30%	100%
정확도	53.3	54.0	54.5	58.8

M3AE는 비교에서 ImageNet 선형 분류에서 MAE보다 현저히 우수한 성능을 보인다(예: 보고된 설정에서 61.3대 44.6).
M3AE는 페어 데이터와 비페어 데이터의 혼합을 활용해 부분 페어링에서도 강한 전달력을 달성한다.
더 높은 텍스트 마스크 비율(약 50-75% 이상)이 M3AE의 성능에 더 큰 이점을 주며, 전통적인 BERT와 같은 설정과 달리 더 좋은 성능을 낸다.
M3AE는 더 큰 모델 크기와 더 긴 학습에서도 ViT-S/16, ViT-B/16, ViT-L/16 변형에서 일관되게 MAE를 능가한다.
정성적 분석은 주의(attention)가 관련 이미지 영역 및 해당 텍스트 토큰과 정렬되는 경향을 보여 주며, 비전-언어의 공동 이해를 시사한다.
M3AE는 CC12M 및 ImageNet에서 Out-of-Distribution 탐지 및 재구성 품질에서 강인성을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.