QUICK REVIEW

[论文解读] Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adrià Recasens|arXiv (Cornell University)|Jun 29, 2020

Multimodal Machine Learning Applications参考文献 92被引用 195

一句话总结

本文提出 Self-Supervised MultiModal Versatile (MMV) networks，能够从未标记的视频中学习视觉、音频和语言的联合表征，并具有将模型应用到静态图像的 deflation 机制，以及在零样本和有监督迁移任务上的出色表现。

ABSTRACT

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

研究动机与目标

推动从未标记的视频数据中学习通用、可扩展的多模态表征。
开发一个能够接收视觉、音频或文本，并在模态之间进行比较的网络。
兼顾各模态的特定粒度，实现细粒度的视觉/音频相似性与粗粒度的文本对齐。
通过 deflation 机制，实现对视频流和静态图像的高效应用。

提出的方法

使用各模态专用的主干网络和投影头，将每个模态嵌入到共享或分层空间中。
研究三种模态嵌入图（Shared、Disjoint、Fine-and-Coarse FAC）以在联合空间中对齐模态。
利用多模态对比损失进行训练，强制同一视频中的正样本对，不同视频中的负样本对。
使用 MIL-NCE 进行文本对齐，以处理叙述与视频内容之间的错配。
引入一个 deflation 程序，将视频训练的网络转换为无需标注的图像驱动网络。
通过省略相应的损失项并重新加权其余损失来处理缺失的模态。

实验结果

研究问题

RQ1一个单一的多模态网络能否有效整合从未标记视频中学习到的视觉、听觉与文本信息？
RQ2哪种模态嵌入图在跨模态对齐、模态内粒度和跨模态可导航性之间的权衡上最优？
RQ3对视频训练网络进行 deflation 是否能在不增加额外监督的情况下提供有竞争力的图像表征？
RQ4三模态模型在标准视频、音频和图像基准上与两模态基线相比如何？

主要发现

FAC（Fine and Coarse）嵌入策略在 UCF101、HMDB51、MSRVTT 和 ESC-50 上表现出色，优于两模态配置。
在三个模态下训练通常能改善视觉表征并支持跨模态检索任务。
将 HowTo100M 与 AudioSet 结合可提升 HMDB51、UCF101 和 ESC-50，并在缺少文本的情况下更好地利用音频数据。
经过 deflated 的视频到图像网络使在图像任务上的评估成为可能且具有竞争力，而无需新的注释。
该方法在若干基准上达到自监督方法的最新水平，并在像 Kinetics600 这样的大规模任务上接近有监督性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。