QUICK REVIEW

[论文解读] LPAT: Learning to Predict Adaptive Threshold for Weakly-supervised Temporal Action Localization

Xudong Lin, Zheng Shou|arXiv (Cornell University)|Oct 25, 2019

Human Pose and Action Recognition参考文献 29被引用 3

一句话总结

该论文提出LPAT，一种用于弱监督时序动作定位的新方法，通过基于背景分数预测每个片段的自适应阈值，消除了人工调整阈值的需要。通过引入一种新颖的约束损失，联合优化定位与分类任务，LPAT在仅使用视频级监督的情况下，在THUMOS'14和ActivityNet1.2数据集上取得了最先进性能。

ABSTRACT

Recently, Weakly-supervised Temporal Action Localization (WTAL) has been densely studied because it can free us from costly annotating temporal boundaries of actions. One prevalent strategy is obtaining action score sequences over time and then truncating segments of scores higher than a fixed threshold at every kept snippet. However, the threshold is not modeled in the training process and manually setting the threshold introduces expert knowledge, which damages the coherence of systems and makes it unfair for comparisons. In this paper, we propose to adaptively set the threshold at each snippet to be its background score, which can be learned to predict (LPAT). In both training and testing time, the predicted threshold is leveraged to localize action segments and the scores of these segments are allocated for video classification. We also identify an important constraint to improve the confidence of generated proposals, and model it as a novel loss term, which facilitates the video classification loss to improve models' localization ability. As such, our LPAT model is able to generate accurate action proposals with only video-level supervision. Extensive experiments on two standard yet challenging datasets, i.e., THUMOS'14 and ActivityNet1.2, show significant improvement over state-of-the-art methods.

研究动机与目标

消除弱监督时序动作定位中人工设定阈值的需求，该需求会引入偏差并阻碍公平比较。
通过在训练过程中学习阈值预测，提升定位模型的一致性与端到端训练效果。
通过一种新颖的约束损失提升提议框的置信度，使定位与视频分类目标更加一致。
仅使用视频级标注，在标准基准上实现最先进性能。

提出的方法

LPAT将每个片段的阈值学习为预测的背景分数，从而实现阈值机制的端到端训练。
模型利用预测的阈值在训练和推理阶段截断高分动作片段。
引入一种新颖的约束损失，以提升生成提议框的置信度，强化定位与视频分类之间的联系。
该方法联合优化动作定位与视频分类任务，共享相同的分数序列用于两项任务。
阈值预测具有可微性，允许反向传播同时优化阈值与动作分数头。
该框架仅依赖视频级监督，无需弱监督定位数据集中的边界框标注。

实验结果

研究问题

RQ1自适应阈值学习是否能在无需人工调整的情况下提升弱监督时序动作定位的性能？
RQ2将阈值作为预测的背景分数进行学习，对提议框质量与模型泛化能力有何影响？
RQ3引入新颖约束损失对提议框置信度与分类引导定位有何影响？
RQ4在仅使用视频级监督的情况下，统一模型联合优化定位与分类的潜力有多大？
RQ5LPAT在THUMOS'14与ActivityNet1.2等标准基准上与最先进方法相比表现如何？

主要发现

LPAT在THUMOS'14数据集上实现了最先进性能，其动作定位的平均平均精度显著优于先前方法。
在ActivityNet1.2数据集上，LPAT相较于现有弱监督方法表现出显著改进，证实了其在多样化数据集上的泛化能力。
所提出的约束损失有效提升了生成动作提议的置信度，从而带来更可靠的定位结果。
通过端到端学习阈值，LPAT消除了对专家定义阈值的依赖，提升了模型的公平性与一致性。
通过共享分数序列联合优化定位与分类任务，显著提升了两项性能，证明了统一训练的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。