QUICK REVIEW

[论文解读] TriDet: Temporal Action Detection with Relative Boundary Modeling

Dingfeng Shi, Yujie Zhong|arXiv (Cornell University)|Mar 13, 2023

Human Pose and Action Recognition被引用 9

一句话总结

TriDet 引入基于 SGP 的层以替代自注意力，并引入 Trident-head 用于相对边界建模，从而提升时序动作检测，聚焦边界定位和多尺度特征。

ABSTRACT

In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.

研究动机与目标

解决视频骨干网络中高序列特征相似性及由此带来的自注意力等级损失。
提出 SGP（Spatial-Global Projection）层，用类似卷积神经网络的运算替代自注意力。
引入 Trident-head 用于相对边界概率建模，以改善边界定位。
在 THUMOS14 和 HACS 数据集上验证有效性并分析计算效率。

提出的方法

用 SGP 层替代自注意力，放宽权重约束，并通过多尺度逐通道卷积来模拟自注意力效应。
实现一个即时级分支以增加行动与非行动时刻之间的差异性。
实现一个窗口级分支（ψ 组成部分）以捕获更广的语义上下文并稳定尺度选择。
提出 Trident-head，使其学习相对边界概率，聚焦于边界同时考虑内部特征。
与基于 SA 的 transformer 和动态滤波器进行对比，并在标准基准上报告计算成本（延迟）和 mAP 结果。

实验结果

研究问题

RQ1如何在基于视频的时序动作检测中缓解自注意力引起的等级损失？
RQ2用 CNN 类的 SGP 层替代自注意力是否能提升特征辨别性和边界定位？
RQ3边界感知的头部（Trident-head）是否通过利用相对边界概率提供更精准的动作边界？
RQ4在 THUMOS14 和 HACS 数据集上使用 SGP 与 Trident-head 的精度与延迟权衡是什么？

主要发现

SGP 层提升了即时级别的可区分性并改善检测性能。
Trident-head 学会在考虑内部动作特征的同时强调动作边界，从而产生更准确的边界概率。
在 HACS 上，平均 mAP 值在消融实验中分别报告为 36.3、38.0 和 38.6，与 THUMOS14 结果一致。
transformer 的宏观结构在没有自注意力的情况下仍然有效，支持所提出的 SGP 方法。
尽管 TriDet 相对于纯 CNN 会增加计算量，但完整卷积结构在 GPU 上的效率仍优于自注意力，报告了延迟方面的收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。