Skip to main content
QUICK REVIEW

[论文解读] Video Representation Learning with Visual Tempo Consistency

Ceyuan Yang, Yinghao Xu|arXiv (Cornell University)|Jun 28, 2020
Human Pose and Action Recognition参考文献 63被引用 61
一句话总结

论文通过在慢速和快速节奏下对同一动作的视频施加相似性约束,提出视觉节奏一致性与分层对比学习(VTHCL),以自监督方式学习视频表征,在动作识别方面实现具竞争力的表现并对其他任务有良好迁移,同时提供一个实例对应图用于解释。

ABSTRACT

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%) and HMDB-51 (49.2\%). Moreover, comprehensive experiments suggest that the learned representations are generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, we propose Instance Correspondence Map (ICM) to visualize the shared semantics captured by contrastive learning.

研究动机与目标

  • 证明视觉节奏可以作为视频表征学习的自监督信号。
  • 提出一个分层对比学习框架,在多网络深度利用节奏诱导的语义。
  • 展示在 UCF-101 和 HMDB-51 上具有竞争力的动作识别结果,并具备对 AVA 检测与 Epic-Kitchen 预期任务的迁移能力。
  • 通过实例对应映射(ICM)解释学习到的表征,以揭示共享的实例语义。

提出的方法

  • 使用慢速和快速视频编码器,在同一动作实例上以不同节奏提取表征。
  • 应用对比学习,在同一实例的快慢表征之间最大化互信息,同时将其他实例作为负样本。
  • 通过引入多深度特征(如 res3、res4、res5)以及每个深度对应的记忆库,扩展到分层对比学习。
  • 采用带动量更新的记忆库机制,以稳定表征学习。
  • 通过非线性映射 phi 和温度 T 定义相似性函数 h,以计算跨时间的相似性。
  • 引入实例对应映射(ICM),以可视化对比目标捕捉的共享语义。
Figure 1: Visual Tempo Consistency enforces the network to learn the high representational similarity between the same instance ( e.g. $V_{i}$ ) sampled at different tempos ( e.g. $V_{i}^{s}$ and $V_{i}^{f}$ ). Meanwhile, it also follows the same mechanism as previous instance discrimination task [
Figure 1: Visual Tempo Consistency enforces the network to learn the high representational similarity between the same instance ( e.g. $V_{i}$ ) sampled at different tempos ( e.g. $V_{i}^{s}$ and $V_{i}^{f}$ ). Meanwhile, it also follows the same mechanism as previous instance discrimination task [

实验结果

研究问题

  • RQ1相同动作的两段剪辑之间的视觉节奏变动是否比单一节奏方法提供更强的自监督信号?
  • RQ2利用时间深度的分层对比学习能否提升视频表征学习?
  • RQ3节奏一致的表征对下游任务(如检测、预期)等的迁移能力如何?
  • RQ4我们是否能够通过 ICM 对模型学习的共享实例语义进行定性解释?

主要发现

方法骨干帧数UCF-101 (Top-1)HMDB-51 (Top-1)
VTHCL-R18 (Ours)3D-ResNet18880.648.6
VTHCL-R50 (Ours)3D-ResNet50882.149.2
  • 使用慢速和快速节奏对的 VTHCL 在 UCF-101(R50 Top-1 82.1%)和 HMDB-51(R50 Top-1 49.2%)上实现具有竞争力的动作识别。
  • 使用多网络深度(如 res3/res4/res5)的分层对比学习相较单深度对比损失提升性能。
  • 增大慢速与快速剪辑之间的节奏差(α)通常能提升准确率,相较于基线单一节奏对比学习。
  • VTHCL 表征可迁移到其他任务,包括 AVA 动作检测和 Epic-Kitchen 动作预期,展示了超越识别的泛化能力。
  • ICM 提供定性可视化,显示所学表征定位出辨别性区域和移动对象,揭示无标签的共享实例语义。
Figure 2: Framework. (a) The same instance with different tempos ( e.g. $V_{i}^{f}$ and $V_{i}^{s}$ ) should share high similarity in terms of their discriminative semantics while are dissimilar to other instances (grey dots). (b) The features at various depths of networks allow to construct the hie
Figure 2: Framework. (a) The same instance with different tempos ( e.g. $V_{i}^{f}$ and $V_{i}^{s}$ ) should share high similarity in terms of their discriminative semantics while are dissimilar to other instances (grey dots). (b) The features at various depths of networks allow to construct the hie

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。