QUICK REVIEW

[论文解读] Self-supervised Video Representation Learning by Pace Prediction

Jiangliu Wang, Jianbo Jiao|arXiv (Cornell University)|Aug 13, 2020

Human Pose and Action Recognition参考文献 2被引用 61

一句话总结

提出将节奏预测作为自监督预任务，以在不使用运动通道的情况下学习视频表征，并通过对比学习增强；在多种骨干网络上实现动作识别和视频检索的最先进结果。

ABSTRACT

This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace.

研究动机与目标

以视频节奏敏感性类似于人类感知为动机，推动自监督视频表征学习。
引入一个节奏预测前文本任务，使用在不同节奏下随机采样的片段来学习时空特征。
通过对比学习加强节奏任务以进行正则化并提升判别能力。
在动作识别和视频检索任务上，基于多种骨干网络(C3D, 3D-ResNet, R(2+1)D, S3D-G)进行评估。
展示该方法的有效性以及在无标签视频数据条件下扩展的潜力。

提出的方法

从未标注视频中以多种节奏采样视频片段，创建节奏预测前文本任务。
训练一个3D CNN骨干网络，用交叉熵损失对应用于每个输入片段的节奏进行分类。
引入对比学习，最大化正样本对（相同节奏或相同语境）的一致性并分离负样本。
研究两种对比配置：相同语境（内容感知）与相同节奏（与内容无关），及其对性能的影响。
通过加权和目标将节奏预测损失与对比损失结合。
在多种骨干网络(C3D, 3D-ResNet, R(2+1)D, S3D-G)以及下游任务如动作识别和视频检索上进行评估。

Figure 1: Illustration of the proposed pace prediction task. Given a video sample, frames are randomly selected by different paces to formulate the training inputs. Here, three different clips, Clip I, II, III , are sampled by normal, slow and fast pace randomly. Can you ascribe the corresponding pa

实验结果

研究问题

RQ1基于节奏的前文本任务是否能够在不使用运动通道的情况下学习出强大的时空视频表征？
RQ2加入对比学习是否能进一步提升通过节奏预测学习到的表征？
RQ3不同的骨干架构对基于节奏的自监督有何反应？
RQ4相同语境与相同节奏的对比策略对下游性能的影响？
RQ5在未标注数据上预训练时，所提出方法在标准视频理解基准（动作识别和检索）上的表现如何？

主要发现

仅凭节奏预测就相对于随机初始化在多种骨干网络上取得了显著提升。
引入对比学习进一步提升性能，同语境对比在多数设置下通常优于同节奏对比。
R(2+1)D骨干结合节奏预测在所评估的配置中在UCF101和HMDB51上取得最佳结果。
将节奏预测与基于语境的对比学习相结合，在与当代自监督方法的比较中实现了最先进或具有竞争力的结果。
注意力可视化表明，在基于节奏的监督下，模型聚焦于运动区域，支持所学习的时空推理。
该方法在仅使用视频模态的情况下，在动作识别和视频检索任务上均表现出色。

Figure 2: Generating training samples and pace labels from the proposed pretext task. Here, we show five different sampling paces, named as super slow , slow , normal , fast , and super fast . The darker the initial frame is, the faster the entire clip plays.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。