QUICK REVIEW

[论文解读] TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

Tuan N. Tang, Kwonyoung Kim|arXiv (Cornell University)|Mar 16, 2023

Human Pose and Action Recognition被引用 21

一句话总结

TemporalMaxer 使用一个简单、无参数的最大池化块，从预先提取的 3D-CNN 特征中最大化局部时间信息，在 TAL 中以更高速度和更少参数超越长期 TCM 方法。

ABSTRACT

Temporal Action Localization (TAL) is a challenging task in video understanding that aims to identify and localize actions within a video sequence. Recent studies have emphasized the importance of applying long-term temporal context modeling (TCM) blocks to the extracted video clip features such as employing complex self-attention mechanisms. In this paper, we present the simplest method ever to address this task and argue that the extracted video clip features are already informative to achieve outstanding performance without sophisticated architectures. To this end, we introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features with a basic, parameter-free, and local region operating max-pooling block. Picking out only the most critical information for adjacent and local clip embeddings, this block results in a more efficient TAL model. We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term TCM such as self-attention on various TAL datasets while requiring significantly fewer parameters and computational resources. The code for our approach is publicly available at https://github.com/TuanTNG/TemporalMaxer

研究动机与目标

通过质疑对重度长期时序上下文建模（TCM）的必要性，推动对时序动作定位（TAL）的极简方法。
研究当与简单的局部最大池化块配对时，预先提取的 3D-CNN 特征是否包含足够的信息用于准确的 TAL。
将 TemporalMaxer 开发为一个无参数的局部上下文模块，替代昂贵的基于注意力的 TCM 块。
在标准 TAL 基准上评估 TemporalMaxer，以将准确性和推理速度与 Transformer 和基于图的长期 TCM 方法进行比较。

提出的方法

从预训练的 3D-CNN 提取片段特征以形成序列 X。
使用两个一维卷积投影在金字塔层之间以及 L-1 TemporalMaxer 块（步幅为 2 的最大池化）构建多尺度时序特征金字塔 Z。
使用由分类和回归分支组成的轻量化头部对各金字塔层进行共享解码。
使用多任务损失进行训练，结合 Focal 分类损失和 DIoU 回归损失，跨所有层应用并为正样本设定指示器。
将 TCM 块的内核大小固定为 3；在消融试验中与卷积、子采样、平均池化和 Transformer 进行对比。
目标是一个简单的、非参数化的主干，其中最大池化操作在利用深层网络感受野的同时保留判别性的局部信息。

实验结果

研究问题

RQ1在使用高质量的预提取特征时，基于无参数的最大池化 TCM 块是否足以最大化 TAL 的时序上下文？
RQ2与 Transformer-/基于图的长期 TCM 方法相比，TemporalMaxer 是否能在显著更少参数和更低计算成本下实现具有竞争力或更优的 TAL 性能？
RQ3相对于最先进的基线，TemporalMaxer 在标准 TAL 数据集（THUMOS14、EPIC-Kitchens 100、MultiTHUMOS、MUSES）上的表现如何？
RQ4Max Pooling TCM 块不同内核大小对 TAL 性能和效率的影响是什么？

主要发现

模型	特征	0.3	0.4	0.5	0.6	0.7	平均	时间（毫秒）
ActionFormer [60]	I3D [7]	82.1	77.8	71.0	59.4	43.9	66.8	80
Our (TemporalMaxer)	I3D [7]	82.8	78.9	71.8	60.5	44.7	67.7	50

TemporalMaxer 在 THUMOS14 的 tIoU 阈值平均下达到 67.7 mAP，超过包括长期 TCM 方法在内的先前方法。
TemporalMaxer 减少了骨干网络计算并实现更快的推断，例如 THUMOS14 上每个视频 50 ms，而 ActionFormer 基线成本较高。
在 EPIC-Kitchens 100 上，TemporalMaxer 的动词平均 mAP 为 24.5%，名词为 22.8%，分别比 ActionFormer 基线高出约 1.0% 和 0.9%。
在 MUSES 上，TemporalMaxer 实现 27.2 的平均 mAP，超过以往的长期 TCM 方法。
在 MultiTHUMOS 上，TemporalMaxer 实现 29.9% 的平均 mAP，显著高于 PointTAD 和 ActionFormer 基线。
消融研究表明 Max Pooling 作为 TCM 块优于 Conv、Subsampling 和 Average Pooling，内核大小为 3 时达到峰值性能并保持良好效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。