[Paper Review] Is Space-Time Attention All You Need for Video Understanding?
TimeSformer builds a convolution-free video classifier using only self-attention over space and time, with divided space-time attention delivering best accuracy on Kinetics benchmarks.
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/facebookresearch/TimeSformer.
Motivation & Objective
- Motivate convolution-free video modeling by leveraging self-attention for spatiotemporal learning.
- Extend Vision Transformer (ViT) to video by treating frame patches as tokens in a space-time sequence.
- Systematically compare self-attention schemes to identify efficient and accurate designs for video classification.
Proposed method
- Represent a video clip as a sequence of frame-level patches embedded into tokens with positional encoding.
- Employ multi-head self-attention over space-time neighborhoods to build a Transformer encoder for video.
- Investigate five space-time attention schemes (Space, Joint Space-Time, Divided Space-Time, Sparse Local Global, Axial) and compare performance and efficiency.
- Adopt a Divided Space-Time Attention design (temporal then spatial) as the preferred scheme for better accuracy and scalability.
- Pretrain on ImageNet (1K or 21K) and fine-tune on video datasets; compare against 3D CNN baselines in terms of accuracy and training/inference cost.
Experimental results
Research questions
- RQ1Can self-attention alone, without convolution, learn effective spatiotemporal representations for video understanding?
- RQ2Which space-time attention scheme offers the best trade-off between accuracy and computational efficiency for video classification?
- RQ3How does TimeSformer perform relative to 3D CNNs on standard benchmarks like Kinetics-400/600 and Something-Something-V2?
- RQ4What is the impact of pretraining data scale (ImageNet-1K vs ImageNet-21K) and input length/resolution on TimeSformer performance?
- RQ5Is TimeSformer capable of efficient long-range video modeling compared to traditional CNN-based approaches?
Key findings
- Divided Space-Time Attention achieves the best accuracy on Kinetics-400 and Something-Something-V2 among the schemes tested.
- TimeSformer with divided attention has higher accuracy and scalability than joint space-time attention, especially as spatial resolution and clip length grow.
- TimeSformer attains competitive or state-of-the-art results on Kinetics-400/600 while offering lower inference costs and faster training than comparable 3D CNNs.
- Pretraining on ImageNet-21K generally improves K400 results, while SSv2 benefits similarly from ImageNet-1K/21K pretraining.
- TimeSformer enables longer input clips (up to 96 frames) and scalable training by treating video as a sequence of patches, often outperforming 3D CNNs in training efficiency.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.