[论文解读] Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017
该论文提出一个基于时序卷积和锚框机制的 temporal action proposal 模型(Prop-SSAD),通过 TAG 提案和边界细化,在 ActivityNet 2017 的 proposal 和 localization 任务上达到最先进的结果。
In this notebook paper, we describe our approach in the submission to the temporal action proposal (task 3) and temporal action localization (task 4) of ActivityNet Challenge hosted at CVPR 2017. Since the accuracy in action classification task is already very high (nearly 90% in ActivityNet dataset), we believe that the main bottleneck for temporal action localization is the quality of action proposals. Therefore, we mainly focus on the temporal action proposal task and propose a new proposal model based on temporal convolutional network. Our approach achieves the state-of-the-art performances on both temporal action proposal task and temporal action localization task.
研究动机与目标
- Motivate that proposal quality is the main bottleneck for temporal action localization on untrimmed videos.
- Introduce a proposal model based on temporal convolution with anchor mechanism (Prop-SSAD).
- Augment proposals with Temporal Actionness Grouping (TAG) and refine boundaries to improve recall at high IoU.
- Demonstrate state-of-the-art performance on both temporal action proposal and localization tasks in ActivityNet 2017.
- Show that using video-level classification results can yield competitive temporal action localization results.
提出的方法
- Extract snippet-level two-stream features (appearance and motion) from untrimmed videos and resize to length 256.
- Use Prop-SSAD, a temporal anchor-based detector using multiple temporal feature maps (seven maps with lengths 1,2,4,8,16,32,64) to predict action presence and boundaries.
- Train with overlap loss for proposals and use anchor-based predictions to form proposals without initial classification.
- Implement Temporal Actionness Grouping (TAG) with an MLP to produce actionness scores and generate additional proposals.
- Refine Prop-SSAD boundaries by replacing with TAG proposals that have maximum IoU > 0.75, yielding refined proposals.
- For localization, assign video-level action categories to proposals using video-level classification results and evaluate using standard mAP across IoU thresholds.
实验结果
研究问题
- RQ1Can a temporal convolutional, anchor-based framework (Prop-SSAD) generate high-quality temporal action proposals without external data?
- RQ2Does combining Prop-SSAD proposals with TAG proposals improve proposal recall, especially at higher IoU thresholds?
- RQ3To what extent do refined boundaries influence temporal action localization performance on ActivityNet 2017?
- RQ4Is end-to-end training feasible and beneficial for the proposed framework within ActivityNet 2017?
主要发现
- Prop-SSAD outperforms the baseline and TAG refinement improves recall, especially at higher IoU thresholds.
- Refined Prop-SSAD achieves higher AR-AN scores than Prop-SSAD (e.g., AR-AN improved from 61.52 to 64.40 in Table 1).
- Proposals refined by TAG lead to better localization results when combined with video-level classification results for action categories.
- On the validation set, their localization results show competitive or superior mAP across IoU thresholds compared to prior methods, and their testing-set average mAP is notably higher (32.26) than several baselines.
- Using first N proposals for localization indicates that localization mAP benefits from higher-quality early proposals (e.g., Ours@1–Ours@100 show progressive gains).
- The study concludes that anchor mechanisms and temporal convolution are effective for temporal action proposal tasks.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。