QUICK REVIEW

[论文解读] Self-Supervised Spatio-Temporal Representation Learning Using Variable Playback Speed Prediction

Hyeon Cho, Taehoon Kim|arXiv (Cornell University)|Mar 5, 2020

Human Pose and Action Recognition被引用 30

一句话总结

该论文提出了一种自监督的时空表征学习方法，通过预测视频片段的不同播放速度来学习时间动态特性，而无需人工标注。通过训练3D卷积神经网络（3D CNN）对视频片段按播放速度进行排序（包括正向和反向），并引入与网络层相关的时序组归一化方法，该方法在动作识别基准上取得了最先进性能。

ABSTRACT

We propose a self-supervised learning method by predicting the variable playback speeds of a video. Without semantic labels, we learn the spatio-temporal representation of the video by leveraging the variations in the visual appearance according to different playback speeds under the assumption of temporal coherence. To learn the spatio-temporal variations in the entire video, we have not only predicted a single playback speed but also generated clips of various playback speeds with randomized starting points. We then train a 3D convolutional network by solving the formulation that sorts the shuffled clips by their playback speed. In this case, the playback speed includes both forward and reverse directions; hence the visual representation can be successfully learned from the directional dynamics of the video. We also propose a novel layer-dependable temporal group normalization method that can be applied to 3D convolutional networks to improve the representation learning performance where we divide the temporal features into several groups and normalize each one using the different corresponding parameters. We validate the effectiveness of the proposed method by fine-tuning it to the action recognition task. The experimental results show that the proposed method outperforms state-of-the-art self-supervised learning methods in action recognition.

研究动机与目标

在无需人工标注标签的情况下学习鲁棒的时空表征。
利用不同播放速度下的时间一致性与视觉外观变化作为监督信号。
通过建模正向与反向播放动态，提升3D CNN的学习能力。
通过一种与层相关的时序组归一化方法，提升3D卷积中的特征归一化效果。

提出的方法

该方法生成起始点随机的视频片段，并采用多种播放速度（包括反向）以创建多样化的训练样本。
训练3D卷积神经网络，根据播放速度对打乱的片段进行排序，形成对比学习目标。
该方法利用时间一致性：在不同播放速度下保持一致的视觉变化，为表征学习提供监督信号。
提出一种新型的与层相关的时序组归一化方法，通过分层特定参数对时序特征进行分组与归一化，以提升特征质量。
模型以自监督方式预训练，并在下游动作识别任务上进行微调。
该方法将按播放速度对片段排序的问题形式化为多分类问题，学习区分时间动态特性。

实验结果

研究问题

RQ1可变播放速度预测能否作为学习视频时空表征的有效自监督信号？
RQ2同时建模正向与反向播放方向如何提升时间动态的学习效果？
RQ3与层相关的时序组归一化方法在多大程度上提升了3D CNN中的表征学习能力？
RQ4与现有自监督方法相比，该方法是否在动作识别基准上达到了最先进性能？

主要发现

所提方法在动作识别基准上优于现有的最先进自监督学习方法。
引入反向播放速度提升了模型捕捉视频序列中方向性动态的能力。
与层相关的时序组归一化方法显著提升了特征表征质量，从而增强了模型性能。
该模型在标准动作识别数据集上实现了强大的零样本与微调性能，展现出良好的泛化能力。
自监督预训练策略能有效学习时空特征，且完全无需任何人工标注标签。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。