Skip to main content
QUICK REVIEW

[论文解读] Self-supervised Co-training for Video Representation Learning

Tengda Han, Weidi Xie|arXiv (Cornell University)|Oct 19, 2020
Human Pose and Action Recognition参考文献 72被引用 273
一句话总结

本文提出 CoCLR,一种自监督协同训练框架,在 RGB 与光流视图之间交换正样本以提升对比学习,在视频动作识别与检索中以更高的训练效率接近 UberNCE 的性能。

ABSTRACT

The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.

研究动机与目标

  • 探究仅以实例辨别是否最充分利用视频数据进行自监督学习。
  • 评估来自语义类别的困难正样本是否能提升对比视频表征。
  • 提出一种自监督协同训练方案(CoCLR),在互补视图(RGB 与 flow)之间挖掘正样本。
  • 在下游任务上评估所学习的表征:在 UCF101、HMDB51 和 Kinetics-400 上的动作识别与视频检索。

提出的方法

  • 将 InfoNCE 基线(实例辨识)与使用语义标签的 oracle UberNCE 进行比较。
  • 引入 CoCLR 以挖掘跨视图正样本:利用 flow 视图中前 K 相似片段来增强 RGB 训练,反之亦然。
  • 在 RGB 与 flow 网络之间交替优化,以逐步改进表征。
  • 使用两阶段训练:(i) RGB 与 flow 的独立 InfoNCE 预训练,(ii) 使用跨视图正样本的交替协同训练。
  • 通过线性探针和检索评估来衡量所学表征的迁移性。
Figure 1: Two video clips of a golf-swing action and their corresponding optical flows. In this example, the flow patterns are very similar across different video instances despite significant variations in RGB space. This observation motivates the idea of co-training, which aims to gradually enhanc
Figure 1: Two video clips of a golf-swing action and their corresponding optical flows. In this example, the flow patterns are very similar across different video instances despite significant variations in RGB space. This observation motivates the idea of co-training, which aims to gradually enhanc

实验结果

研究问题

  • RQ1在视频表征学习中,结合语义类正样本(UberNCE)是否比仅使用实例的 InfoNCE 更优?
  • RQ2在 RGB 与光流视图之间进行协同训练是否能挖掘更困难的正样本并提升下游性能?
  • RQ3CoCLR 相较于单视图自监督方法以及 UberNCE 在动作识别和检索中的表现如何?
  • RQ4诸如顶级-K 正样本挖掘(K)和交替循环等超参数对 CoCLR 性能有何影响?

主要发现

  • UberNCE 的表现优于 InfoNCE,说明实例判别可能浪费数据资源。
  • CoCLR 相较于 InfoNCE 和 CMC 有显著提升,在线性探针动作识别(RGB)和检索方面接近 UberNCE 的性能。
  • 双流 CoCLR(RGB+Flow)进一步提升结果,RGB 与 Flow 模型提供互补收益。
  • 端到端微调缩小了各训练方案之间的性能差距,但在预训练迁移场景中 CoCLR 仍然优越。
  • CoCLR 在 UCF101 和 Kinetics-400 上展现出与其它自监督方法相媲美或处于前沿的结果,且训练更高效、所需数据更少。
Figure 3: Nearest neighbour retrieval results with CoCLR representations. The left side is the query video from the UCF101 testing set, and the right side are the top 3 nearest neighbours from the UCF101 training set. CoCLR is trained only on UCF101. The action label for each video is shown in the u
Figure 3: Nearest neighbour retrieval results with CoCLR representations. The left side is the query video from the UCF101 testing set, and the right side are the top 3 nearest neighbours from the UCF101 training set. CoCLR is trained only on UCF101. The action label for each video is shown in the u

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。