QUICK REVIEW

[论文解读] Can Temporal Information Help with Contrastive Self-Supervised Learning?

Yutong Bai, Haoqi Fan|arXiv (Cornell University)|Nov 25, 2020

Human Pose and Action Recognition参考文献 33被引用 29

一句话总结

该论文提出 TaCo，一种时间感知的对比自监督学习框架，通过将时间变换同时用作数据增强和自监督信号，提升视频表征学习性能。通过引入针对多种视频级预训练任务（如动作反转、速度变化）的任务特定头，TaCo 实现了最先进性能，在 UCF-101 上达到 85.1% 的 top-1 准确率，在 HMDB-51 上达到 51.6%，相较于之前的方法分别提升了 3% 和 2.4% 的相对性能。

ABSTRACT

Leveraging temporal information has been regarded as essential for developing video understanding models. However, how to properly incorporate temporal information into the recent successful instance discrimination based contrastive self-supervised learning (CSL) framework remains unclear. As an intuitive solution, we find that directly applying temporal augmentations does not help, or even impair video CSL in general. This counter-intuitive observation motivates us to re-design existing video CSL frameworks, for better integration of temporal knowledge. To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL. Specifically, TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. By jointly contrasting instances with enriched temporal transformations and learning these transformations as self-supervised signals, TaCo can significantly enhance unsupervised video representation learning. For instance, TaCo demonstrates consistent improvement in downstream classification tasks over a list of backbones and CSL approaches. Our best model achieves 85.1% (UCF-101) and 51.6% (HMDB-51) top-1 accuracy, which is a 3% and 2.4% relative improvement over the previous state-of-the-art.

研究动机与目标

探究时间信息是否能提升对比自监督学习（CSL）在视频表征学习中的表现。
识别为何在现有 CSL 框架中直接应用时间增强通常会失败或降低性能。
设计一种新框架，通过将时间变换同时用作增强和自监督信号，有效整合时间知识到 CSL 中。
探索不同视频预训练任务之间的内在关系，以及其组合如何影响学习效率。
建立一种可泛化、灵活的无监督视频表征学习范式，超越现有方法。

提出的方法

TaCo 提出时间变换的双重用途：作为强数据增强手段，同时作为视频理解的自监督信号。
在标准对比学习设置基础上扩展额外的任务头，每个头专门针对特定的时间预训练任务，如动作反转、片段重排或速度变化。
框架联合优化不同增强视图之间的对比损失，以及每个时间变换对应的任务特定损失，实现在多个任务间共享表征学习。
使用平衡超参数 λ 来权衡对比损失与任务特定损失，确保后者在训练过程中不会主导。
该方法兼容多种主干网络（如 ResNet-18、R(2+1)D-18、ResNet-50）和对比学习框架（如 MoCo、InstDisc）。
在 UCF-101 和 HMDB-51 等标准基准上，采用线性评估和微调协议对框架进行评估。

实验结果

研究问题

RQ1时间信息能否提升对比自监督学习在视频表征学习中的表现？
RQ2为何在现有 CSL 框架中直接应用时间增强通常会失败或降低性能？
RQ3是否存在一种更有效的方法，将时间知识整合到 CSL 中，而不仅仅是作为简单数据增强？
RQ4不同视频预训练任务之间是否存在内在关联，可被用于提升自监督效果？
RQ5能否通过将多个时间预训练任务与对比学习结合，构建统一框架以实现更优性能？

主要发现

在微调设置下，TaCo 在 UCF-101 上达到 85.1% 的 top-1 准确率，在 HMDB-51 上达到 51.6%，相较于之前最先进方法分别提升 3% 和 2.4% 的相对性能。
‘speed + shuffle’ 与 ‘rotation jittering + reverse’ 任务组合表现最佳，表明特定任务对之间存在协同效应。
当关闭对比损失仅优化任务损失时，性能显著下降，证明对比学习在 TaCo 中的关键作用。
用于平衡对比损失与任务损失的超参数 λ 在 λ=10 时效果最佳，且在 10–15 范围内性能稳定。
TaCo 在不同主干网络和 CSL 框架下均持续提升性能，展现出良好的泛化能力与鲁棒性。
即使在线性评估设置下，TaCo 也优于原始 CSL 和时间增强基线方法，证实其在学习可迁移表征方面的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。