QUICK REVIEW

[论文解读] YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection

Jianhua Yang, Kun Dai|arXiv (Cornell University)|Feb 14, 2023

Human Pose and Action Recognition被引用 11

一句话总结

YOWOv2 引入了一种实时、无锚点、多级时空动作检测器，通过将3D骨干网络与多级2D骨干网络以及解耦 fusion 头进行融合，在 UCF101-24 和 AVA 上实现了速度-精度的最佳权衡。

ABSTRACT

Designing a real-time framework for the spatio-temporal action detection task is still a challenge. In this paper, we propose a novel real-time action detection framework, YOWOv2. In this new framework, YOWOv2 takes advantage of both the 3D backbone and 2D backbone for accurate action detection. A multi-level detection pipeline is designed to detect action instances of different scales. To achieve this goal, we carefully build a simple and efficient 2D backbone with a feature pyramid network to extract different levels of classification features and regression features. For the 3D backbone, we adopt the existing efficient 3D CNN to save development time. By combining 3D backbones and 2D backbones of different sizes, we design a YOWOv2 family including YOWOv2-Tiny, YOWOv2-Medium, and YOWOv2-Large. We also introduce the popular dynamic label assignment strategy and anchor-free mechanism to make the YOWOv2 consistent with the advanced model architecture design. With our improvement, YOWOv2 is significantly superior to YOWO, and can still keep real-time detection. Without any bells and whistles, YOWOv2 achieves 87.0 % frame mAP and 52.8 % video mAP with over 20 FPS on the UCF101-24. On the AVA, YOWOv2 achieves 21.7 % frame mAP with over 20 FPS. Our code is available on https://github.com/yjh0410/YOWOv2.

研究动机与目标

推动能够准确检测小动作的实时时空动作检测。
开发一个多级、无锚点的检测框架以提升小实例检测。
高效融合3D时空特征与多级2D空间特征。
提供一系列模型（Tiny、Medium、Large）以适应不同的计算预算。

提出的方法

使用3D骨干从视频片段中提取时空特征。
使用带有特征金字塔网络的多级2D骨干，在三个级别生成解耦的分类和回归特征。
引入ChannelEncoder以在DANet风格的自注意力步骤中融合2D和3D特征。
在每个级别分别使用解耦融合头将F_ST与F_cls和F_reg融合。
采用无锚点的动态标签分配（SimOTA）进行训练，无需预定义锚点。
使用同时包含 conf、cls 和 reg 项的损失函数进行训练，并通过一个因子lambda进行平衡。

实验结果

研究问题

RQ1多级无锚点检测器是否能在实现实时时空动作检测的同时改善对小动作的定位？
RQ22D和3D特征的解耦融合是否在STAD中优于耦合融合？
RQ3在如UCF101-24和AVA这类数据集上，不同尺寸的骨干网络（Tiny/Medium/Large）对速度和精度有何影响？

主要发现

YOWOv2-Tiny/Medium/Large 与 YOWO 在 UCF101-24 上相比，在更低的 FLOPs 和参数下实现更高的帧 mAP 和视频 mAP。
解耦融合头优于耦合融合头，在带来微小速度权衡的同时提升 F-mAP 和 V-mAP。
动态标签分配（SimOTA）使无锚点训练成为可能，性能具有竞争力。
在 UCF101-24 上，YOWOv2-L 在 16 帧时达到 85.2% F-mAP 和 52.0% V-mAP，RTX 3090 下为 30 FPS；在 32 帧时，YOWOv2-L 提升至 87.0% F-mAP 和 52.8% V-mAP，FPS 为 22。
在 AVA 上，YOWOv2-L 在超过 20 FPS 时达到 21.7% 的帧 mAP（K=16）。
YOWOv2-T 在 F-mAP 和 V-mAP 上超过 YOWO，同时使用远少于 FLOPs 和参数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。