QUICK REVIEW

[论文解读] Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition

Helei Qiu, Biao Hou|arXiv (Cornell University)|Jan 8, 2022

Human Pose and Action Recognition被引用 38

一句话总结

本文提出 STTFormer，一个基于 transformer 的模型，将短时空关节元组进行编码以捕捉连续帧之间的跨关节相关性，并具备帧间聚合模块以区分相似动作，在 NTU RGB+D 与 NTU RGB+D 120 数据集上达到 state-of-the-art 的结果。

ABSTRACT

Capturing the dependencies between joints is critical in skeleton-based action recognition task. Transformer shows great potential to model the correlation of important joints. However, the existing Transformer-based methods cannot capture the correlation of different joints between frames, which the correlation is very useful since different body parts (such as the arms and legs in "long jump") between adjacent frames move together. Focus on this problem, A novel spatio-temporal tuples Transformer (STTFormer) method is proposed. The skeleton sequence is divided into several parts, and several consecutive frames contained in each part are encoded. And then a spatio-temporal tuples self-attention module is proposed to capture the relationship of different joints in consecutive frames. In addition, a feature aggregation module is introduced between non-adjacent frames to enhance the ability to distinguish similar actions. Compared with the state-of-the-art methods, our method achieves better performance on two large-scale datasets.

研究动机与目标

激励并解决现有 transformers 在骨架数据中未能建模同一骨架中不同关节在连续帧之间相关性的局限性。
提出时空元组编码策略以展平并编码连续帧。
开发 STTFormer，具备时空元组注意力（STTA）和帧间特征聚合（IFFA）。
结合位置编码和多模态数据以提升识别准确率。
在大规模骨架动作数据集 NTU RGB+D 和 NTU RGB+D 120 上进行评估，并通过消融验证各组件。

提出的方法

时空元组编码：将骨架序列划分为非重叠的部分（元组），对每个元组在连续帧上进行展平，并通过卷积层进行编码。
时空元组变换器（STTFormer）：STTA 模块建模元组内关节之间的关系；采用多头自注意力，结合 1x1 卷积以及后续的 1xk1 卷积。
帧间特征聚合（IFFA）：k2 × 1 的时序卷积用于整合跨元组的子动作。
位置编码：使用正弦编码以区分元组内的关节和帧。
多模态数据融合：将关节、骨骼和关节运动模态融合用于最终预测。
端到端训练，使用 SGD、交叉熵损失，以及标准数据 padding 至 120 帧。

实验结果

研究问题

RQ1在连续帧之间建模不同关节的相关性是否能提升骨架基础动作识别的性能？
RQ2时空元组编码在降低计算成本的同时能否捕捉跨帧的关节关系？
RQ3帧间聚合是否通过聚合子动作来区分相似动作而有效？
RQ4位置编码和多模态数据如何影响 STTFormer 的性能？

主要发现

方法	NTU RGB+D X-Sub (%)	NTU RGB+D X-View (%)	NTU RGB+D 120 X-Sub (%)	NTU RGB+D 120 X-Set (%)
MTCNN	81.1	87.4	61.2	63.3
IndRNN	81.8	88.0	-	-
HCN	86.5	91.1	-	-
ST-GCN	81.5	88.3	-	-
2s-AGCN	88.5	95.1	82.9	84.9
DGNN	89.9	96.1	-	-
Shift-GCN	90.7	96.5	85.9	87.6
Dynamic-GCN	91.5	96.0	85.9	87.6
MS-G3D	91.5	96.2	86.9	88.4
MST-GCN	91.5	96.6	87.5	88.8
ST-TR	89.9	96.1	-	-
DSTA-Net	91.5	96.4	86.6	89.0
STTFormer(Ours)	92.3	96.5	88.3	89.2

STTFormer 在 NTU RGB+D 与 NTU RGB+D 120 骨架基准上取得了最先进的结果（例如 NTU RGB+D 的 X-Sub 92.3%、X-View 96.5%；NTU RGB+D 120 的 X-Sub 88.3%、X-Set 89.2%）。
消融实验表明去除位置编码会降低准确率（STTFormer 无 PE：X-Sub 89.3%、X-View 91.8；有 PE：X-Sub 89.9%、X-View 94.3%）。
移除帧间聚合（IFFA）会显著降低性能（STTFormer 无 IFFA：X-Sub 84.5%、X-View 88.1%）。
每个元组取 n=6 帧可获得最佳结果（n=1：X-Sub 82.9%、X-View 86.0%；n=6：X-Sub 86.2%、X-View 91.3%）。
多模态数据融合（关节、骨骼、关节运动）在精度上优于单一模态（融合：X-Sub 92.3%、X-View 96.5%；NTU RGB+D 120 上 X-Sub 88.3%、X-Set 89.2%）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。