QUICK REVIEW

[논문 리뷰] 4M: Massively Multimodal Masked Modeling

David Mizrahi, Roman Bachmann|arXiv (Cornell University)|2023. 12. 11.

Multimodal Machine Learning Applications인용 수 9

한 줄 요약

4M은 단일의 통합 Transformer를 훈련시켜 텍스트, 이미지, 기하학, 의미론, 신경 특징 등 많은 모달리티를 모델링하며, 멀티모달 마스크드 모델링 목표를 통해 즉시 활용 가능한 시각 태스크, 보지 않은 모달리티로의 강력한 전이, 그리고 유연한 멀티모달 생성 및 편집을 가능하게 한다.

ABSTRACT

Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities - including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens. 4M leads to models that exhibit several key capabilities: (1) they can perform a diverse set of vision tasks out of the box, (2) they excel when fine-tuned for unseen downstream tasks or new input modalities, and (3) they can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility. Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.

연구 동기 및 목표

다양한 모달리티와 태스크를 다루는 비전용 다목적, 확장 가능한 기반 모델의 개발의 필요성을 촉진한다.
다양한 입력/출력 모달리티에 걸쳐 멀티모달 마스크드 모델링 목표로 학습된 단일 Transformer 인코더–디코더를 제안한다.
단일 모델이 핵심 비전 태스크를 수행하고, 보지 않은 모달리티와 태스크로의 전이 및 멀티모달 조건부 생성/편집을 지원함을 보여준다.

제안 방법

모달리티별 토크나이저를 사용하여 각 모달리티를 시퀀스 또는 이산 토큰 집합으로 매핑함으로써 다양 한 모달리티를 통합한다.
교차 어텐션과 모달리티별 디코더 마스크를 활용하여 임의의 모달리티 쌍 사이를 매핑하는 단일 Transformer 인코더–디코더를 사용한다.
모든 모달리티로부터 입력 토큰 하위 집합과 대상 토큰 하위 집합을 무작위로 샘플링하는 멀티모달 마스크드 모델링 목표로 학습하여 확장 가능한 교차 모달 예측 코딩을 가능하게 한다.
CC12M에서 파생된 바인딩 네트워크를 사용하여 명시적 멀티모달 주석이 부족한 모달리티를 정렬하는 의사-labeled 멀티모달 데이터셋으로 사전 학습한다.
임의의 모달리티를 조건으로 여러 모달리티를 생성하고 편집하기 위해 반복적인 토큰 디코딩을 통한 생성 능력을 시연한다.

실험 결과

연구 질문

RQ1단일의 통합 모델이 텍스트, 영상 유사 모달리티, 및 신경 특징 간의 교차 모달 표현을 학습할 수 있는가?
RQ2멀티모달 마스킹 및 토크나이제이션이 확장성, 보지 않은 모달리티로의 전이, 생성/편집 능력에 어떤 영향을 미치는가?
RQ3모델이 즉시 비전 태스크를 수행할 수 있는지, 그리고 보지 않은 다운스트림 태스크나 모달리티에 대해 파인튜닝 후에 얼마나 잘 수행하는가?
RQ4입력/대상 모달리티 선택 및 마스킹 전략이 표현 학습 및 다운스트림 전이에 어떤 영향을 미치는가?
RQ5임의의 모달리티를 조건으로 하는 스티어러블한 멀티모달 제너레이터로서 모델의 효과는 얼마나 되는가?

주요 결과

4M은 특정 태스크별 아키텍처 없이도 풍부한 교차 모달 표현을 학습하여 많은 비전 태스크를 가능하게 한다.
모든 입력 및 대상 모달리티에서의 사전 학습은 탐지, 분할 및 깊이 추정 등 다운스트림 태스크로의 강한 전이를 보여주며, 다수의 베이스라인보다 우수한 성능을 보인다.
모델은 멀티모달 조건부 생성 및 인-페인팅을 지원하여 의미론적 편집 및 기하학적 기반 생성이 가능하다.
절삭 연구에서 멀티모달 사전 학습과 대상 모달리티의 선택이 전이 성능에 상당한 영향을 미치며, 전 모달 사전 학습이 일반적으로 가장 넓은 활용성을 제공한다.
확장성 분석은 더 큰 데이터셋, 더 긴 학습, 더 큰 모델 크기로 성능이 향상되며, 실용적인 한계까지 상승한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.