QUICK REVIEW

[论文解读] Video Representation Learning by Dense Predictive Coding

Tengda Han, Weidi Xie|arXiv (Cornell University)|Sep 10, 2019

Human Pose and Action Recognition参考文献 49被引用 48

一句话总结

Dense Predictive Coding (DPC) 通过密集、连续的方式预测未来嵌入，学习自监督的时空视频表征，使用课程化训练扩展未来预测，并在仅使用 RGB 帧时实现强的动作识别性能。

ABSTRACT

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

研究动机与目标

为适用于动作识别的时空视频嵌入激发自监督学习的动机。
引入 Dense Predictive Coding (DPC) 以从过去的上下文预测未来的密集表示。
提出一个课程化训练方案，以在逐渐减少时序上下文的情况下预测更远的未来。
展示 DPC 结合 RGB 流在 UCF101 和 HMDB51 上达到自监督领域的最先进结果，并接近 ImageNet 预训练基线。
评估自监督增益与下游监督性能之间的相关性。

提出的方法

用 3D-ResNet 编码器对视频块进行编码以获得 z_t。
用 ConvGRU 将过去的潜在码 z_t 汇聚成上下文 c_t。
使用一个小型预测器预测未来嵌入 hat{z}_{t+1}, hat{z}_{t+2}, ...。
使用跨空间位置和时间步的密集多路 Noise Contrastive Estimation (NCE) 损失进行训练。
利用逐帧增强来避免对光流的依赖，并采用课程学习来扩展未来预测的时间窗。
可选地在下游动作识别任务上微调所学表示。

实验结果

研究问题

RQ1从 RGB 视频学习密集时空嵌入的自监督学习是否能产生与动作识别竞争力的表征？
RQ2使用课程式时间表对未来进行预测是否能改进语义表示学习？
RQ3在标准动作识别基准上，DPC 与先前的自监督方法相比如何？
RQ4相比于投影到单一向量，密集预测、顺序方法是否对学习有用的视频表示是必需的？

主要发现

DPC 与 RGB 流在 UCF101（在某些设定下 top1 75.7%）和 HMDB51（top1 35.7%）上实现了自监督领域的最先进性能，优于以往的仅 RGB 方法。
在 Curriculum 训练方案下，对未来时空块进行密集、顺序预测可改进所学习的表示和下游动作识别。
在更大的数据集（Kinetics-400）上进行预训练比仅在 UCF101 上训练得到更强的下游性能，显示出规模带来的好处。
在 DPC 预训练阶段的自监督准确率与下游监督动作识别准确率之间存在正相关。
在课程学习下预测更远的未来即使在扩展任务上的自监督准确率较低，也能提升下游性能，表明更强的语义学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。