[论文解读] An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition
本文提出 AGC-LSTM,一种在骨架基础动作识别中引入注意力的图卷积LSTM 网络,捕捉时空特征及其共现,使用时序分层结构来扩大时域感受野并降低计算量,在 NTU RGB+D 与 Northwestern-UCLA 数据集上实现最先进结果。
Skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Recent studies have shown that exploring spatial and temporal features of the skeleton sequence is vital for this task. Nevertheless, how to effectively extract discriminative spatial and temporal features is still a challenging problem. In this paper, we propose a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains. We also present a temporal hierarchical architecture to increases temporal receptive fields of the top AGC-LSTM layer, which boosts the ability to learn the high-level semantic representation and significantly reduces the computation cost. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance information of key joints in each AGC-LSTM layer. Experimental results on two datasets are provided: NTU RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate the effectiveness of our approach and show that our approach outperforms the state-of-the-art methods on both datasets.
研究动机与目标
- 推动鲁棒的骨架基础动作识别,利用空间配置和时序动态。
- 提出一个统一的模型,捕捉空间域与时间域之间的共现。
- 引入注意力机制,在各时间步突出区分性的关节点。
- 引入时域分层结构,以增大时域感受野并降低计算量。
- 在标准基准数据集(NTU RGB+D 和 Northwestern-UCLA)上展示最先进的性能。
提出的方法
- 将 3D 关节点坐标映射到每个关节的空间特征,使用线性层
- 通过将关节点位置信息特征与帧差特征拼接并通过共享的 LSTM 进行尺度归一化来获得增强特征
- 用三层堆叠的 AGC-LSTM 层、通过图卷积来捕捉时空模式
- 应用注意力网络,在每个时间步突出关键关节并混合注意与未注意的特征
- 引入时序平均池化以创建时序层级,增加感受野并降低计算
- 将最后一层 AGC-LSTM 的全局(所有关节)与局部(注意的关节)特征进行融合用于分类
实验结果
研究问题
- RQ1如何使用基于图的方法有效提取骨架序列的判别性时空特征?
- RQ2在关节上加入注意力机制是否能提升动作相关的时空配置的判别力?
- RQ3时域分层结构是否在提升高层次时空表示的同时降低计算?
- RQ4关节级与部位级建模及其组合在骨架基础动作识别中的表现如何?
主要发现
| 方法 | CV | CS | |
|---|---|---|---|
| HBRNN-L | 64.0 | 59.1 | |
| Part-aware LSTM | 70.3 | 62.9 | |
| Trust Gate ST-LSTM | 77.7 | 69.2 | |
| Two-stream RNN | 79.5 | 71.3 | |
| STA-LSTM | 81.2 | 73.4 | |
| Ensemble TS-LSTM | 81.3 | 74.6 | |
| Visualization CNN | 82.6 | 76.0 | |
| VA-LSTM | 87.6 | 79.4 | |
| ST-GCN | 88.3 | 81.5 | |
| SR-TSL | 92.4 | 84.8 | |
| HCN | 91.1 | 86.5 | |
| PB-GCN | 93.2 | 87.5 | |
| AGC-LSTM (Joint) | - | 93.5 | 87.5 |
| AGC-LSTM (Part) | - | 93.8 | 87.5 |
| AGC-LSTM (Joint&Part) | - | 95.0 | 89.2 |
- AGC-LSTM 带注意力在 NTU RGB+D 上实现了最先进的准确率(Joint/Part/Joint&Part:93.5/93.8/95.0 CV,87.5/87.5/89.2 CS)以及 Northwestern-UCLA(Joint/Part/Joint&Part:93.3/?/?在所列表格中)
- 关节级和部位级变体均达到顶尖性能,在 NTU RGB+D 上关节&部位融合获得最佳结果
- 在消融分析中,用 GC-LSTM 替换 LSTM 并增加时序层次显著提高了准确率(如 GC-LSTM+TH 相对于 GC-LSTM,以及 AGC-LSTM 相对于 GC-LSTM)
- 注意力嵌入在各层逐步强调关键关节(如肘部、手腕、手部),可通过注意力可视化看到
- 时序分层结构增加了时域感受野并在不牺牲准确度的前提下降低了计算量
- 混合关节&部位建模相较单分支变体提供了进一步的性能提升
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。