Skip to main content
QUICK REVIEW

[论文解读] Weakly Supervised Action Localization by Sparse Temporal Pooling Network

Phuc Nguyen, Ting Liu|arXiv (Cornell University)|Dec 14, 2017
Human Pose and Action Recognition参考文献 46被引用 53
一句话总结

论文提出 Sparse Temporal Pooling Network (STPN),一种弱监督方法,利用视频级标签和稀疏驱动的注意力机制,在未裁剪视频中定位动作,通过 Temporal Class Activation Maps (T-CAMs) 生成时序提案。

ABSTRACT

We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.

研究动机与目标

  • Motivate learning to localize actions in untrimmed videos using only video-level labels.
  • Develop a network that selects a sparse subset of key video segments for action recognition.
  • Fuse class-agnostic attentions with temporal class activations to propose action intervals.

提出的方法

  • Two-stream I3D feature extractors (RGB and flow) pretrained on Kinetics are used to represent video segments.
  • An attention module produces segment-level weights; a sparsity loss enforces a sparse selection of segments.
  • Video-level classification is performed via attention-weighted temporal pooling of segment features.
  • Temporal Class Activation Maps (T-CAMs) for each class are computed to form one-dimensional temporal proposals.
  • Weighted T-CAMs combine RGB and flow with a fusion parameter alpha to score proposals.
  • Non-maximum suppression is applied to temporal proposals per class.

实验结果

研究问题

  • RQ1Can actions in untrimmed videos be accurately localized using only video-level labels?
  • RQ2Does enforcing sparsity in segment selection improve weakly supervised action localization?
  • RQ3How effective are Temporal Class Activation Maps (T-CAMs) combined with class-agnostic attention for proposing action intervals?
  • RQ4What is the impact of using RGB, flow, or their combination for proposal scoring?

主要发现

  • STPN achieves state-of-the-art results among weakly supervised methods on THUMOS14.
  • On THUMOS14, STPN with UntrimmedNet features outperforms prior weakly supervised approaches.
  • On ActivityNet1.3, STPN shows competitive weakly supervised performance and surpasses some fully supervised baselines in certain settings.
  • Ablation studies show that both the attention mechanism and the sparsity loss substantially improve performance.
  • Two-stream (RGB+flow) features outperform single-modality results, with flow contributing stronger cues for localization.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。