QUICK REVIEW

[论文解读] Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Heeseung Kwon, Manjin Kim|arXiv (Cornell University)|Feb 14, 2021

Human Pose and Action Recognition参考文献 64被引用 4

一句话总结

本文提出 SELFY，一种神经模块，通过将局部区域表示为其空间和时间邻居的关联相似性，来学习时空自相似性（STSS），从而建模视频中的运动。通过端到端地利用完整的 STSS 体积进行无监督学习，该方法在 Something-Something-V1/V2、Diving-48 和 FineGym 数据集上实现了最先进（SOTA）的动作识别性能，有效捕捉了长程交互和快速运动。

ABSTRACT

Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 & V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.

研究动机与目标

解决时空卷积在建模视频中运动动态方面的局限性。
开发一种能够捕捉空间和时间中结构模式的运动表征，超越直接的外观特征。
通过自相似性实现无额外监督的端到端运动表征学习。
通过关系特征学习建模长期交互和快速运动，提升动作识别的鲁棒性。

提出的方法

该方法通过计算局部区域与其在时空邻域中的邻居之间的相似性，构建时空自相似性（STSS）表征。
通过相似性计算将外观特征转换为关系值，使模型能够学习视频体积中的结构模式。
设计了一种名为 SELFY 的神经模块，用于处理整个 STSS 体积并提取有效的运动表征。
SELFY 具有可微性，可无缝集成到现有神经架构中，实现无辅助监督的端到端训练。
通过在时空上使用足够大的邻域，建模长程依赖关系和快速运动动态。
STSS 表征与主干网络联合学习，使模型能够聚焦于与运动相关的模式。

实验结果

研究问题

RQ1时空自相似性能否作为视频动作识别中通用且鲁棒的运动表征？
RQ2基于自监督 STSS 的表征在多大程度上能有效捕捉视频中的长期交互和快速运动？
RQ3STSS 在动作识别中与传统时空卷积特征的互补程度如何？
RQ4基于 STSS 的神经模块能否集成到现有架构中，并在无额外监督下实现端到端训练？

主要发现

所提方法在 Something-Something-V1 和 V2 基准上实现了最先进性能，证明了其卓越的运动建模能力。
在 Diving-48 数据集上取得了新的最先进结果，凸显其在复杂动作识别中的有效性。
在 FineGym 基准上也实现了最先进性能，证实了其在多样化动作类别中的鲁棒性。
STSS 表征有效捕捉了长期交互和快速运动，相较于基线方法显著提升了识别准确率。
该方法与时空卷积特征表现出强烈的互补性，表明其学习到了独特且有价值的运动模式。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。