QUICK REVIEW

[논문 리뷰] Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adrià Recasens|arXiv (Cornell University)|2020. 06. 29.

Multimodal Machine Learning Applications참고 문헌 92인용 수 195

한 줄 요약

본 논문은 비지도 학습 기반의 Self-Supervised MultiModal Versatile (MMV) 네트워크를 제시하며, 비표지 영상에서 시각, 음향, 언어 표현을 공동으로 학습하고, 이미지를 위한 디플레이션(deflation) 메커니즘과 강력한 제로샷 및 감독 전이 성능을 갖춘다.

ABSTRACT

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

연구 동기 및 목표

레이블이 없는 비디오 데이터로부터 일반적이고 다재다능한 멀티모달 표현을 학습하도록 동기를 부여한다.
비전, 오디오 또는 텍스트를 입력받아 모달리티 간 비교가 가능하도록 네트워크를 개발한다.
모달리티별 세부 수준을 존중하여 시각/음향의 미세한 유사성과 텍스트 정렬의 거친한 정렬을 가능하게 한다.
Deflation 메커니즘을 통해 비디오 스트림과 정적 이미지 모두에 효율적으로 적용 가능하게 한다.

제안 방법

각 모달리티를 모달리티별 백본과 프로젝션 헤드를 사용하여 공유 공간 또는 계층적 공간에 임베딩한다.
공동 공간에서 모달리티를 정렬하기 위해 세 가지 모달리티 임베딩 그래프(Shared, Disjoint, Fine-and-Coarse FAC)를 조사한다.
동일 비디오의 양의 쌍과 다른 비디오의 음의 쌍을 강제하는 다중모달 컨트라스트 손실로 학습한다.
비디오 콘텐츠와의 설명 나레이션 불일치를 다루기 위해 텍스트 정렬에 MIL-NCE를 사용한다.
레이블 없이 비디오 학습 네트워크를 이미지 기반 네트워크로 변환하는 deflation 절차를 도입한다.
모달리티 누락 시 해당 손실 항을 생략하고 남은 손실의 가중치를 재조정하여 처리한다.

실험 결과

연구 질문

RQ1단일 멀티모달 네트워크가 레이블 없는 비디오에서 학습된 시각적, 청각적, 텍스트 정보를 효과적으로 통합할 수 있는가?
RQ2교차 모달 정렬, 모달리티 내 세부성, 교차 모달 탐색 가능성 사이의 최적 균형을 어떤 모달리티 임베딩 그래프가 제공하는가?
RQ3비디오 학습 네트워크를 디플레이션해 추가 감독 없이도 경쟁력 있는 이미지 표현을 제공하는가?
RQ4표준 비디오, 오디오, 이미지 벤치마크에서 세 모달 모델은 두 모달 베이스라인과 어떻게 비교되는가?

주요 결과

FAC(Fine and Coarse) 임베딩 전략은 UCF101, HMDB51, MSRVTT, ESC-50에서 강력한 성능을 내며 두 모달 구성보다 우수하다.
세 가지 모달리티로 학습하면 시각 표현이 일반적으로 향상되고 교차 모달 검색 작업을 지원한다.
HowTo100M과 AudioSet의 결합은 HMDB51, UCF101, ESC-50를 개선하고 텍스트가 없는 경우에도 오디오 데이터를 더 잘 활용할 수 있게 한다.
압축 해제된 비디오-투-이미지 네트워크는 새로운 주석 없이도 이미지 작업에서 경쟁력 있는 성능으로 평가를 가능하게 한다.
이 방법은 여러 벤치마크에서 자기지도 방법 중 최첨단 성과를 달성하고 Kinetics600 같은 대규모 작업에서 감독 학습 수준에 근접한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.