QUICK REVIEW

[论文解读] Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

Fabien Baradel, Christian Wolf|arXiv (Cornell University)|Mar 29, 2017

Human Pose and Action Recognition参考文献 51被引用 73

一句话总结

这篇论文将基于姿态的CNN与基于姿态的时空注意机制融合在RGB视频上，在NTU-RGB+D和SBU Kinect Interaction上达到最先进的结果，并将知识迁移到较小的数据集。

ABSTRACT

We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.

研究动机与目标

通过使用多模态数据（关节姿态和RGB帧）来推动准确的动作识别。
开发一个带姿态感知的双流架构，利用姿态来引导RGB注意力。
将姿态信息编码到随时间变化的卷积表示中。
实现自适应时序池化以有效融合时序信息。
展示将从大数据集学习到的表示迁移到较小基准数据集的能力。

提出的方法

将姿态数据编码为沿着对拓扑结构有信息的关节排序组织的3D张量，供CNN处理。
使用仅姿态的CNN从姿态子序列中提取分层姿态特征。
在RGB帧上实现一个以姿态特征为条件的空间注意力机制，使用一个可训练的窥视传感器，聚焦于四个手部。
用LSTM处理窥视输出；在每个时间步在各手之间集成基于注意力的融合。
应用时序注意力机制在时间上自适应地池化LSTM特征。
在logit层融合姿态和RGB流，并端到端训练（在RGB流训练期间对某些组件进行冻结）。

实验结果

研究问题

RQ1在通过注意力机制将姿态驱动的特征与RGB视频结合时，是否能提升动作识别性能？
RQ2将RGB空间注意力条件化为姿态特征是否能提升模型对有信息区域（如手部和被操作对象）的聚焦？
RQ3时序注意力是否能在时间上有效融合特征以提高准确性？
RQ4从大型数据集（NTU）进行知识迁移是否有益于较小的数据集（MSR Daily Activity 3D, SBU Kinect Interaction）？
RQ5关节排序和姿态表示对识别性能有何影响？

主要发现

方法	CS	CV	平均值
Lie Group	50.1	52.8	51.5
Skeleton Quads	38.6	41.4	40.0
Dynamic Skeletons	60.2	65.2	62.7
HBRNN	59.1	64.0	61.6
Deep LSTM	60.7	67.3	64.0
Part-aware LSTM	62.9	70.3	66.6
ST-LSTM + TrustG.	69.2	77.7	73.5
STA-LSTM	73.4	81.2	77.2
JTM	76.3	81.1	78.7
DSSCA - SSLM	74.9	-	-
Ours (pose only)	90.5	-	-
Ours (RGB only)	72.0	-	-
Ours (pose + RGB)	94.1	-	-

在NTU RGB+D上，单姿态模型和姿态+RGB模型都达到最新结果。
使用完整模型在SBU Kinect Interaction数据集上达到最新结果。
在MSR Daily Activity 3D上的表现具有竞争力，体现了非常小数据集的挑战。
采用拓扑、邻域保持序列的关节排序在NTU上比随机排序提升>1个百分点。
姿态条件化的空间注意力显著提升了RGB-only的性能，在RGB-only设定中的增益更大（约1–12点）相较于多模态设置。
从NTU到MSR和SBU的数据迁移使较小数据集的性能得到有意义的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。