QUICK REVIEW

[论文解读] You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

Okan Köpüklü, Xiangyu Wei|arXiv (Cornell University)|Nov 15, 2019

Human Pose and Action Recognition参考文献 45被引用 107

一句话总结

YOWO 提供一个实时的单阶段架构，包含在 2D 关键帧和 3D 剪辑分支，通过 CFAM 融合，在时空中定位动作，达到最先进/帧级 mAP，同时实现实时速度。

ABSTRACT

Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips, which is currently the fastest state-of-the-art architecture on spatiotemporal action localization task. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and ~12%, respectively. Moreover, YOWO is the first and only single-stage architecture that provides competitive results on AVA dataset. We make our code and pretrained models publicly available.

研究动机与目标

激励实时的时空动作定位，无需单独的候选框提议和融合阶段。
提出一个统一的端到端架构，结合二维空间特征和三维时序特征。
展示在标准基准上的实时性能和有竞争力的准确性。
研究通过通道注意力实现跨分支聚合的有效特征融合。

提出的方法

引入 YOWO，包含两条并行分支：对关键帧使用 2D-CNN，对短视频片段使用 3D-CNN。
通过基于 Gram 矩阵相关性的通道融合与注意力机制（CFAM）融合两分支特征。
使用类似 YOLO 的头部，在每个网格单元5个锚框的设置下，进行单阶段的边界框回归。
使用复合损失端到端训练，包括用于本地化的平滑L1、用于置信度的均方误差，以及用于分类的 focal loss（α 平衡变体）。
推理时加入长期特征库（LFB），在不牺牲因果性的前提下提升时序上下文。
使用连结算法在各帧之间形成动作管道，并评估帧级和视频级性能。

实验结果

研究问题

RQ1单阶段架构是否能有效地融合 2D 空间特征和 3D 时序特征以实现时空动作定位？
RQ2基于 Gram 矩阵的通道注意力模块是否能提升跨分支的特征融合与检测精度？
RQ3剪辑长度、下采样和主干网络复杂度对准确性和速度的权衡是什么？
RQ4在 UCF101-24、J-HMDB-21 和 AVA 上，YOWO 相较于以往方法的表现如何，特别是在在线/因果设置中？

主要发现

YOWO 在 16 帧剪辑下实现 34 帧/秒，在 8 帧剪辑下实现 62 帧/秒，使其成为当时时空动作定位领域最快的现有方法。
在 UCF101-24 上，YOWO 2D+3D+CFAM 在 IoU=0.5 时达到 79.2% 帧级 mAP（相较于 2D-alone 的 61.6%、3D-alone 的 70.5%、2D+3D 的 73.8%）。
在 J-HMDB-21 上，YOWO 的 2D+3D+CFAM 为 64.9% 帧级 mAP（2D 为 36.0%、3D 为 41.5%、2D+3D 为 47.1%）。
在 AVA 上，YOWO 的 2D+3D+CFAM 为 16.4% 帧级 mAP（2D 为 13.2%、3D 为 13.7%、2D+3D 为 16.0%）。
消融实验表明，3D-CNN 提供更强的分类召回；2D-CNN 提供定位能力；CFAM 同时提升两者。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。