QUICK REVIEW

[论文解读] Labelling unlabelled videos from scratch with multi-modal self-supervision

Yuki M. Asano, Mandela Patrick|arXiv (Cornell University)|Jun 24, 2020

Human Pose and Action Recognition参考文献 82被引用 71

一句话总结

论文介绍 SeLaVi，是一种多模态自监督聚类方法，通过利用音视频对应性，联合学习未标注视频的表示和伪标签，在若干视频数据集上实现了无监督标注的最新状态。

ABSTRACT

A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE.

研究动机与目标

通过实现无监督标注来降低视频数据的标注成本的动机。
开发一个从多模态（音频-视觉）视频数据中学习的聚类框架，无需人工注释。
在处理不平衡（Zipf-like）类别分布的同时，确保簇反映语义内容。
通过将音频和视觉流作为增强并对齐它们来实现模态鲁棒的聚类。
在标准视频数据集上提供强有力的基准，以确立无监督标注的性能。

提出的方法

将聚类表述为一个最优传输问题，以防止退化解（SeLa 基础工作）。
放宽聚类先验的均匀性，以适应现实世界中偏斜的分布，并通过 Sinkhorn 优化允许任意先验。
通过将模态视为增强并学习模态无关的簇，引入多模态单标签学习。
在初始化时对齐模态特定编码器，以对齐各模态的输出。
学习多个去相关的聚类头，以并行探索多样、正交的标签。
使用一对编码器（音频和视觉）训练，产生共享的聚类分配，并应用模态拼接增强。

实验结果

研究问题

RQ1多模态自监督聚类是否能够在没有人类注释的情况下产生具有语义意义的视频标签？
RQ2将音视频对应性与模态对齐纳入其中，与单模态或事后标注相比，对聚类质量有何影响？
RQ3多个去相关的聚类头是否提高对视频有效标签空间的覆盖？
RQ4学习到的聚类对退化模态（例如压缩的视觉内容）有多鲁棒？

主要发现

SeLaVi 在 VGG-Sound、AVE 和 Kinetics-Sound 上实现了最先进的聚类指标，显著高于基线在 NMI、ARI 和准确率。
同时使用音频和视觉模态比任一模态单独使用获得更高的聚类质量，当模态对齐时收益显著。
去相关的聚类头和模态对齐显著提升聚类性能，相比单头或简单拼接基线。
该方法在没有标记数据的情况下可以完美分组 VGG-Sound 的 32% 和 AVE 的 55% 视频，在 AVE 上达到 57.9% 的准确率。
SeLaVi 学到的无监督标签支持下游表示学习的改进，包括视频动作检索性能的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。