QUICK REVIEW

[论文解读] Weakly Supervised Action Localization by Sparse Temporal Pooling Network

Phuc Nguyen, Ting Liu|arXiv (Cornell University)|Dec 14, 2017

Human Pose and Action Recognition参考文献 46被引用 53

一句话总结

论文提出 Sparse Temporal Pooling Network (STPN)，一种弱监督方法，利用视频级标签和稀疏驱动的注意力机制，在未裁剪视频中定位动作，通过 Temporal Class Activation Maps (T-CAMs) 生成时序提案。

ABSTRACT

We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.

研究动机与目标

Motivate learning to localize actions in untrimmed videos using only video-level labels.
Develop a network that selects a sparse subset of key video segments for action recognition.
Fuse class-agnostic attentions with temporal class activations to propose action intervals.

提出的方法

Two-stream I3D feature extractors (RGB and flow) pretrained on Kinetics are used to represent video segments.
An attention module produces segment-level weights; a sparsity loss enforces a sparse selection of segments.
Video-level classification is performed via attention-weighted temporal pooling of segment features.
Temporal Class Activation Maps (T-CAMs) for each class are computed to form one-dimensional temporal proposals.
Weighted T-CAMs combine RGB and flow with a fusion parameter alpha to score proposals.
Non-maximum suppression is applied to temporal proposals per class.

实验结果

研究问题

RQ1Can actions in untrimmed videos be accurately localized using only video-level labels?
RQ2Does enforcing sparsity in segment selection improve weakly supervised action localization?
RQ3How effective are Temporal Class Activation Maps (T-CAMs) combined with class-agnostic attention for proposing action intervals?
RQ4What is the impact of using RGB, flow, or their combination for proposal scoring?

主要发现

STPN achieves state-of-the-art results among weakly supervised methods on THUMOS14.
On THUMOS14, STPN with UntrimmedNet features outperforms prior weakly supervised approaches.
On ActivityNet1.3, STPN shows competitive weakly supervised performance and surpasses some fully supervised baselines in certain settings.
Ablation studies show that both the attention mechanism and the sparsity loss substantially improve performance.
Two-stream (RGB+flow) features outperform single-modality results, with flow contributing stronger cues for localization.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。