[论文解读] Weakly Supervised Action Localization by Sparse Temporal Pooling Network
论文提出 Sparse Temporal Pooling Network (STPN),一种弱监督方法,利用视频级标签和稀疏驱动的注意力机制,在未裁剪视频中定位动作,通过 Temporal Class Activation Maps (T-CAMs) 生成时序提案。
We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.
研究动机与目标
- Motivate learning to localize actions in untrimmed videos using only video-level labels.
- Develop a network that selects a sparse subset of key video segments for action recognition.
- Fuse class-agnostic attentions with temporal class activations to propose action intervals.
提出的方法
- Two-stream I3D feature extractors (RGB and flow) pretrained on Kinetics are used to represent video segments.
- An attention module produces segment-level weights; a sparsity loss enforces a sparse selection of segments.
- Video-level classification is performed via attention-weighted temporal pooling of segment features.
- Temporal Class Activation Maps (T-CAMs) for each class are computed to form one-dimensional temporal proposals.
- Weighted T-CAMs combine RGB and flow with a fusion parameter alpha to score proposals.
- Non-maximum suppression is applied to temporal proposals per class.
实验结果
研究问题
- RQ1Can actions in untrimmed videos be accurately localized using only video-level labels?
- RQ2Does enforcing sparsity in segment selection improve weakly supervised action localization?
- RQ3How effective are Temporal Class Activation Maps (T-CAMs) combined with class-agnostic attention for proposing action intervals?
- RQ4What is the impact of using RGB, flow, or their combination for proposal scoring?
主要发现
- STPN achieves state-of-the-art results among weakly supervised methods on THUMOS14.
- On THUMOS14, STPN with UntrimmedNet features outperforms prior weakly supervised approaches.
- On ActivityNet1.3, STPN shows competitive weakly supervised performance and surpasses some fully supervised baselines in certain settings.
- Ablation studies show that both the attention mechanism and the sparsity loss substantially improve performance.
- Two-stream (RGB+flow) features outperform single-modality results, with flow contributing stronger cues for localization.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。