QUICK REVIEW

[论文解读] Skeleton-based Action Recognition via Temporal-Channel Aggregation

Shengqin Wang, Yongji Zhang|arXiv (Cornell University)|May 31, 2022

Human Pose and Action Recognition被引用 25

一句话总结

论文提出 TCA-GCN，即 Temporal-Channel Aggregation Graph Convolutional Network，它动态学习时空拓扑，并通过注意力机制融合多尺度时空特征，在 NTU RGB+D、NTU RGB+D 120 和 NW-UCLA 数据集上达到最先进的结果。

ABSTRACT

Skeleton-based action recognition methods are limited by the semantic extraction of spatio-temporal skeletal maps. However, current methods have difficulty in effectively combining features from both temporal and spatial graph dimensions and tend to be thick on one side and thin on the other. In this paper, we propose a Temporal-Channel Aggregation Graph Convolutional Networks (TCA-GCN) to learn spatial and temporal topologies dynamically and efficiently aggregate topological features in different temporal and channel dimensions for skeleton-based action recognition. We use the Temporal Aggregation module to learn temporal dimensional features and the Channel Aggregation module to efficiently combine spatial dynamic channel-wise topological features with temporal dynamic topological features. In addition, we extract multi-scale skeletal features on temporal modeling and fuse them with an attention mechanism. Extensive experiments show that our model results outperform state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

研究动机与目标

通过在时序和空间特征聚合之间取得平衡来推动骨架基动作识别的提升。
开发一种能够动态学习空间和时序拓扑的模型。
将时序聚合、逐通道拓扑细化和带注意力的多尺度特征融合整合起来。
为多数据流提供动态融合机制，以最大化跨数据集的性能。

提出的方法

引入 Temporal-Channel Aggregation Graph Convolutional Networks (TCA-GCN)，以动态学习时空拓扑。
提出 Temporal Aggregation 以从输入特征中标定时序权重。
提出 Channel Aggregation 将动态学习的通道维拓扑与时序拓扑融合。
在 TCA 块中整合 Channel-wise Topology Modeling（S, Q）和 Temporal Aggregation（TA）。
添加 TF 模块，用于带注意力的多尺度骨架特征融合（MSCONE 和 M attention）。
利用 Algorithm 1 实现四路流（bone、bone motion、joint、joint motion）之间的动态融合以获得自适应权重。

实验结果

研究问题

RQ1一个时序-通道自适应聚合框架是否能够在骨架动作识别中有效平衡时序与空间特征？
RQ2结合时序聚合的动态通道级拓扑细化是否在不同数据集上提升识别精度？
RQ3带注意力的多尺度时序特征融合如何影响动作分类性能？
RQ4动态融合策略是否能在 NTU 和 NW-UCLA 数据集上超越固定权重的多流融合？

主要发现

方法	NW-UCLA 准确率 (%)	X-Sub (NTU-60)	X-View (NTU-60)	X-Sub (NTU-120)	X-Set (NTU-120)
Lie Group (2015)	74.2
HBRNN-L (2015)	78.5
Glimpse Clouds (2018)	87.6
VA-fusion (2018)	88.1
Action Machine (2018)	92.3
AGC-LSTM (2019)	93.3
SGN cite (2020b)	92.5
Shift-GCN (2020c)	94.6
DC-GCN+ADG (2020a)	95.3
CTR-GCN (2021b)	96.5
Ta-CNN (2022)	96.1
Ta-CNN+ (2022)	97.2
TCA-GCN	96.8
TCA-GCN(4sD)	97.0
ST-LSTM (2016)
ST-GCN (2018a)
RA-GCNv1 (2019)
2s-AGCN (2019)
Shift-GCN (2020c)
MST-G3D (2020b)
MST-GCN (2021)
Skeletal-GNN (2021b)
CTR-GCN (2021b)
Ta-CNN (2022)
Ta-CNN+ (2022)
EfficientGCN-B4 (2022b)

在 NW-UCLA、NTU RGB+D 和 NTU RGB+D 120 数据集上达到最先进或具有竞争力的结果。
TCA-GCN 与4流动态融合（4sD）在若干基准上比单流和固定权重融合提升了准确率。
时序聚合利用输入特征标定时序权重，提升时序动态建模。
Channel-wise topology modeling 学习动态空间拓扑，与时空拓扑结合时可获得更丰富的表示。
带注意力的多尺度骨架特征融合进一步增强对动作语义的建模。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。