QUICK REVIEW

[论文解读] Multi-modal Self-Supervision from Generalized Data Transformations

Mandela Patrick, Yuki M. Asano|arXiv (Cornell University)|May 4, 2021

Music and Audio Processing参考文献 86被引用 119

一句话总结

本文提出了广义数据变换（GDTs），这是一种统一框架，系统性地探索视频中多种模态和时间动态下的不变性与差异性。通过显式控制不变性或差异性来建模保持内容的变换，GDTs 实现了最先进性能——在 HMDB-51 上达到 72.8% 的准确率，在 UCF-101 上达到 95.2%，甚至超越了监督预训练方法。

ABSTRACT

In the image domain, excellent representation can be learned by inducing invariance to content-preserving transformations, such as image distortions. In this paper, we show that, for videos, the answer is more complex, and that better results can be obtained by accounting for the interplay between invariance, distinctiveness, multiple modalities and time. We introduce Generalized Data Transformations (GDTs) as a way to capture this interplay. GDTs reduce most previous self-supervised approaches to a choice of data transformations, even when this was not the case in the original formulations. They also allow to choose whether the representation should be invariant or distinctive w.r.t. each effect and tell which combinations are valid, thus allowing us to explore the space of combinations systematically. We show in this manner that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art by a large margin, and even surpassing supervised pretraining. We demonstrate results on a variety of downstream video and audio classification and retrieval tasks, on datasets such as HMDB-51, UCF-101, DCASE2014, ESC-50 and VGG-Sound. In particular, we achieve new state-of-the-art accuracies of 72.8% on HMDB-51 and 95.2% on UCF-101.

研究动机与目标

解决超越图像级失真不变性的复杂性，以学习有效的视频表征。
形式化自监督学习中不变性、差异性、多种模态（如视频和音频）以及时间动态之间的相互作用。
将多种自监督方法统一在一个广义数据变换的单一框架下。
实现对变换组合的系统性探索，识别出能产生最优表征的组合。
在下游视频和音频分类与检索任务中实现最先进性能。

提出的方法

提出广义数据变换（GDTs）作为统一的形式化方法，涵盖各种数据增强，包括空间、时间以及模态特定的变换。
为每种变换明确定义其控制目标：学习的表征应针对该变换保持不变或具有差异性。
将变换空间建模为一组操作，每项操作可分配特定模态（如视频、音频）和时间范围（如帧级、片段级）。
使用对比学习目标，训练模型对某些变换（如颜色抖动）保持不变，对其他变换（如帧顺序打乱）具有差异性，从而促进鲁棒且判别性强的特征。
系统性地搜索变换组合空间，以识别最大化下游性能的最优配置。
使用线性探测或微调方法，将在 HMDB-51、UCF-101、ESC-50、DCASE2014 和 VGG-Sound 等数据集上学习到的表征应用于下游任务。

实验结果

研究问题

RQ1在多种模态和时间上，不变性与差异性之间的相互作用如何影响视频表征学习？
RQ2像 GDTs 这样的统一框架能否泛化并统一多种自监督视频学习方法？
RQ3哪些变换组合能产生在下游性能方面最有效的视频表征？
RQ4使用 GDTs 的自监督学习能否在视频基准上超越监督预训练？
RQ5模态特定的变换（如音频扰动）如何促进多模态表征学习？

主要发现

广义数据变换（GDTs）将大多数先前的自监督视频学习方法统一并泛化为一个单一、系统的框架。
学习对某些变换（如颜色偏移）保持不变、对其他变换（如帧顺序改变）具有差异性的表征，显著提升了性能。
该方法在 HMDB-51 动作识别基准上实现了 72.8% 的新最先进准确率。
该方法在 UCF-101 上达到 95.2% 的准确率，超越了先前最先进水平，甚至优于监督预训练。
该框架实现了对变换组合的系统性探索，识别出在不变性与差异性之间实现最佳平衡的最优配置。
学习到的表征在多种下游任务中表现出良好的泛化能力，包括在 HMDB-51、UCF-101、DCASE2014、ESC-50 和 VGG-Sound 上的视频与音频分类及检索任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。