[论文解读] An Image is Worth 16x16 Words, What is a Video Worth?
这篇论文提出 STAM,一种完全基于 Transformer 的视频动作识别模型,使用空间和时间自注意来从均匀采样的一组稀疏帧中识别人动作,在使用显著更少的帧数和更快的推理时间的情况下,达到近似于现有最先进的准确度。
Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach $80.5$ top-1 accuracy with $ imes 30$ less frames per video, and $ imes 40$ faster inference than the current leading method. Code is available at: https://github.com/Alibaba-MIIL/STAM
研究动机与目标
- Motivate efficient video action recognition amid rising video data volumes.
- Develop a fully transformer-based approach that models spatio-temporal information without 3D convolutions.
- Reduce input frame requirements while maintaining or surpassing state-of-the-art accuracy.
- Demonstrate end-to-end trainability and practical inference benefits over clip-based 3D CNN methods.
提出的方法
- Extend Vision Transformer concepts to video by treating frames as sentences and patches as words.
- Propose Space Time Attention Model (STAM) with separate spatial and temporal transformers.
- Compute frame-level embeddings via spatial attention over patches within each frame, then model temporal dependencies across frames with a temporal transformer.
- Use a classification token per frame and a final video-level tokenizer to produce predictions.
- Provide computationally efficient disentangled spatio-temporal attention to keep complexity manageable (O(FN^2 + F^2)).
- Train temporal transformer components from scratch while leveraging pretrained spatial backbones (ViT-B/ViT variants).
实验结果
研究问题
- RQ1Can a fully transformer-based model capture long-range spatio-temporal dependencies in videos with sparse frame sampling?
- RQ2Does separating spatial and temporal attention improve efficiency and accuracy compared to joint spatio-temporal attention?
- RQ3How does STAM perform versus state-of-the-art 3D CNNs when using significantly fewer frames?
- RQ4What are the trade-offs between frame count, accuracy, and inference speed on benchmarks like Kinetics-400?
主要发现
- STAM achieves competitive to state-of-the-art accuracy while using far fewer frames (e.g., 16 frames) and substantially faster inference.
- On Kinetics-400, STAM with 16 frames attains 79.3% top-1 accuracy at 270 GFLOPs, and with 64 frames reaches 80.5% with 1080 GFLOPs.
- Compared to X3D‑L, STAM yields higher accuracy (79.3% vs 77.5%) with much lower computation (270 GFLOPs vs 744 GFLOPs) and dramatically faster inference (0.05 hrs vs 2.27 hrs on validation sets).
- STAM with 16 frames outperforms X3D-L by a factor of 43 in VPS (frames per second) on a single GPU.
- Temporal attention provides a meaningful accuracy boost over spatial-only models, and using different backbones consistently shows gains with the temporal transformer.
- Increasing frames from 16 to 32 or 64 yields modest accuracy gains (~0.6% per doubling) but further frame counts do not linearly improve performance.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。