QUICK REVIEW

[论文解读] Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang|arXiv (Cornell University)|Feb 9, 2021

Human Pose and Action Recognition参考文献 59被引用 1,309

一句话总结

TimeSformer 使用仅在时空上进行自注意力的卷积-free 视频分类器，分离的时空注意力在 Kinetics 基准上实现最佳精度。

ABSTRACT

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/facebookresearch/TimeSformer.

研究动机与目标

通过利用自注意力进行时空学习来推动卷积-free 视频建模。
通过将帧块视为时空序列中的标记，将 Vision Transformer (ViT) 扩展到视频。
系统性地比较自注意力方案，以识别用于视频分类的高效且准确的设计。

提出的方法

将一个视频片段表示为一系列帧级块，嵌入为带位置编码的标记。
在时空邻域上使用多头自注意力来构建视频的 Transformer 编码器。
研究五种时空注意力方案（ Space, Joint Space-Time, Divided Space-Time, Sparse Local Global, Axial）并比较性能与效率。
采用 Divided Space-Time Attention 设计（先时间后空间）作为偏好方案，以实现更高的准确性和可扩展性。
在 ImageNet (1K 或 21K) 上进行预训练并在视频数据集上微调；在准确性和训练/推理成本方面与 3D CNN 基线进行比较。

实验结果

研究问题

RQ1仅靠自注意力、没有卷积，是否能学习到对视频理解有用的时空表征？
RQ2哪种时空注意力方案在准确性与计算效率之间提供最佳权衡？
RQ3TimeSformer 相对于 3D CNN 在标准基准如 Kinetics-400/600 和 Something-Something-V2 上的表现如何？
RQ4预训练数据规模（ImageNet-1K vs ImageNet-21K）以及输入长度/分辨率对 TimeSformer 性能有何影响？
RQ5TimeSformer 能否在长距离视频建模方面比传统基于 CNN 的方法更高效？

主要发现

Divided Space-Time Attention 在测试的方案中对 Kinetics-400 和 Something-Something-V2 取得了最高精度。
TimeSformer 采用分离注意力相较于联合时空注意力，在准确性和可扩展性方面更高，尤其是在空间分辨率和片段长度增加时。
TimeSformer 在 Kinetics-400/600 上达到有竞争力或最先进的结果，同时在推理成本更低、训练更快方面优于可比的 3D CNN。
在 ImageNet-21K 上的预训练通常改善 K400 的结果，而 SSv2 同样从 ImageNet-1K/21K 预训练中受益。
TimeSformer 通过将视频作为一系列块来处理，能够处理更长的输入片段（长达 96 帧），并实现可扩展的训练，通常在训练效率上超过 3D CNN。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。