QUICK REVIEW

[论文解读] Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations

Longlong Jing, Yingli Tian|arXiv (Cornell University)|Nov 28, 2018

Human Pose and Action Recognition参考文献 35被引用 76

一句话总结

该论文提出了一种基于3DConvNet的自监督框架，通过使用0°、90°、180°和270°旋转等几何变换作为先验任务，学习时空视频特征，从而无需人工标注数据。该方法实现了最先进性能，在UCF101上将动作识别准确率提升了20.4%，在HMDB51上提升了16.7%，分别达到62.9%和33.7%的top-1准确率。

ABSTRACT

To alleviate the expensive cost of data collection and annotation, many self-supervised learning methods were proposed to learn image representations without human-labeled annotations. However, self-supervised learning for video representations is not yet well-addressed. In this paper, we propose a novel 3DConvNet-based fully self-supervised framework to learn spatiotemporal video features without using any human-labeled annotations. First, a set of pre-designed geometric transformations (e.g. rotating 0 degree, 90 degrees, 180 degrees, and 270 degrees) are applied to each video. Then a pretext task can be defined as recognizing the pre-designed geometric transformations. Therefore, the spatiotemporal video features can be learned in the process of accomplishing this pretext task without using human-labeled annotations. The learned spatiotemporal video representations can further be employed as pretrained features for different video-related applications. The proposed geometric transformations (e.g. rotations) are proved to be effective to learn representative spatiotemporal features in our 3DConvNet-based fully self-supervised framework. With the pre-trained spatiotemporal features from two large video datasets, the performance of action recognition is significantly boosted up by 20.4% on UCF101 dataset and 16.7% on HMDB51 dataset respectively compared to that from the model trained from scratch. Furthermore, our framework outperforms the state-of-the-arts of fully self-supervised methods on both UCF101 and HMDB51 datasets and achieves 62.9% and 33.7% accuracy respectively.

研究动机与目标

为解决视频数据标注的高昂成本，实现自监督学习时空视频特征。
开发一种完全自监督的框架，消除视频表示学习对人工标注数据的依赖。
通过基于几何变换先验任务学习的预训练特征，提升动作识别性能。
证明几何变换作为监督信号在学习有意义时空特征方面的有效性。

提出的方法

对输入视频片段应用一组预定义的几何变换——0°、90°、180°和270°旋转。
使用3DConvNet模型训练以预测所应用的几何变换作为先验任务，从而在过程中学习时空特征。
该框架端到端训练，完全不依赖任何人工标注标签，仅依靠变换预测任务。
学习到的特征用于微调下游视频分类任务，如动作识别。
该方法利用几何变换带来的空间和时间不变性，学习鲁棒的视频表示。
在两个大规模视频数据集上评估该方法，以衡量其泛化能力和性能表现。

实验结果

研究问题

RQ1几何变换能否作为自监督视频表示学习的有效监督信号？
RQ23DConvNet在无任何人工标注的情况下，通过旋转预测先验任务能多好地学习时空特征？
RQ3与从零开始训练相比，使用该方法进行预训练在多大程度上提升了下游动作识别性能？
RQ4该框架在标准基准上与最先进完全自监督视频学习方法相比表现如何？

主要发现

所提方法在UCF100数据集上达到62.9%的top-1准确率，优于最先进完全自监督方法。
在HMDB51数据集上，该方法取得33.7%的top-1准确率，创下完全自监督视频学习的新最先进水平。
与从零开始训练相比，通过几何变换先验任务进行预训练，使UCF101上的动作识别准确率提升了20.4%。
与未进行预训练的模型相比，该方法在HMDB51上的动作识别性能提升了16.7%。
在缺乏人工标注的情况下，旋转等几何变换在学习代表性时空特征方面非常有效。
该框架在不同数据集间表现出良好的泛化能力，证明了自监督学习信号的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。