QUICK REVIEW

[論文レビュー] Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition

Helei Qiu, Biao Hou|arXiv (Cornell University)|Jan 8, 2022

Human Pose and Action Recognition被引用数 38

ひとこと要約

本論文は STTFormer を提案する。トランスフォーマーベースのモデルで、連続フレーム間の関節間相関を捕捉する短い時空タプルをエンコードし、フレーム間集約モジュールで類似アクションを識別。NTU RGB+D および NTU RGB+D 120 データセットで最先端の結果を達成。

ABSTRACT

Capturing the dependencies between joints is critical in skeleton-based action recognition task. Transformer shows great potential to model the correlation of important joints. However, the existing Transformer-based methods cannot capture the correlation of different joints between frames, which the correlation is very useful since different body parts (such as the arms and legs in "long jump") between adjacent frames move together. Focus on this problem, A novel spatio-temporal tuples Transformer (STTFormer) method is proposed. The skeleton sequence is divided into several parts, and several consecutive frames contained in each part are encoded. And then a spatio-temporal tuples self-attention module is proposed to capture the relationship of different joints in consecutive frames. In addition, a feature aggregation module is introduced between non-adjacent frames to enhance the ability to distinguish similar actions. Compared with the state-of-the-art methods, our method achieves better performance on two large-scale datasets.

研究の動機と目的

従来のトランスフォーマーが骨格データにおいて連続フレーム間の異なる関節の相関をモデル化できないという制限を動機づけ、これに対処する。
連続するフレームを平坦化してエンコードする時空タプルエンコード戦略を提案する。
spatio-temporal tuples attention (STTA) と inter-frame feature aggregation (IFFA) を用いた STTFormer を開発する。
位置エンコーディングとマルチモードデータを組み込み、認識精度を向上させる。
NTU RGB+D および NTU RGB+D 120 の大規模な骨格アクションデータセットで評価し、成分を検証するためのアブレーションを行う。

提案手法

Spatio-temporal tuples encoding: 骨格系列を非重複のパート（タプル）に分割し、各タプルを連続フレームにわたって平坦化し、畳み込み層でエンコードする。
Spatio-temporal tuples transformer (STTFormer): STTA モジュールを用いてタプル内の関節間の関係をモデル化する。1x1 畳み込みと続く 1xk1 畳み込みを用いたマルチヘッド自己注意。
Inter-frame feature aggregation (IFFA): k2 x 1 の時系列畳み込みでタプルを跨ぐサブアクションを統合する。
Positional encoding: サイン波を用いたエンコーディングでタプル内の関節とフレームを識別する。
Multi-mode data fusion: 最終予測のために joint, bone, および joint-motion のモダリティを融合する。
End-to-end training with SGD, cross-entropy loss, and standard data padding to 120 frames。

実験結果

リサーチクエスチョン

RQ1連続フレーム間で異なる関節間の相関をモデル化することで、骨格ベースのアクション認識を改善できるか。
RQ2Spatio-temporal tuple encoding は、クロスフレームの関節関係を捉えつつ計算コストを削減するか。
RQ3サブアクションを統合して類似アクションを区別する際、インターフレーム集約は有効か。
RQ4位置エンコーディングとマルチモードデータは STTFormer の性能にどう影響するか。

主な発見

方法	NTU RGB+D X-Sub (%)	NTU RGB+D X-View (%)	NTU RGB+D 120 X-Sub (%)	NTU RGB+D 120 X-Set (%)
MTCNN	81.1	87.4	61.2	63.3
IndRNN	81.8	88.0	-	-
HCN	86.5	91.1	-	-
ST-GCN	81.5	88.3	-	-
2s-AGCN	88.5	95.1	82.9	84.9
DGNN	89.9	96.1	-	-
Shift-GCN	90.7	96.5	85.9	87.6
Dynamic-GCN	91.5	96.0	85.9	87.6
MS-G3D	91.5	96.2	86.9	88.4
MST-GCN	91.5	96.6	87.5	88.8
ST-TR	89.9	96.1	-	-
DSTA-Net	91.5	96.4	86.6	89.0
STTFormer(Ours)	92.3	96.5	88.3	89.2

STTFormer は NTU RGB+D および NTU RGB+D 120 骨格ベンチマークで最先端の結果を達成（例：NTU RGB+D の X-Sub 92.3%、X-View 96.5%、 NTU RGB+D 120 の X-Sub 88.3%、X-Set 89.2%）。
アブレーションでは位置エンコーディングを削除すると精度が低下（STTFormer without PE: 89.3% X-Sub, 91.8% X-View; with PE: 89.9% X-Sub, 94.3% X-View）。
インターフレーム集約（IFFA）を削除すると性能が大幅に低下（STTFormer without IFFA: 84.5% X-Sub, 88.1% X-View）。
1タプルあたり n=6 フレームを使用すると最良の結果になる（n=1: 82.9% X-Sub, 86.0% X-View; n=6: 86.2% X-Sub, 91.3% X-View）。
マルチモードデータ融合（joint, bone, joint motion）は単一モードより精度を向上（融合: 92.3% X-Sub, 96.5% X-View; 120 で 88.3% X-Sub, 89.2% X-Set）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。