QUICK REVIEW

[論文レビュー] Hypergraph Transformer for Skeleton-based Action Recognition

Yuxuan Zhou, Zhi-Qi Cheng|arXiv (Cornell University)|Nov 17, 2022

Human Pose and Action Recognition被引用数 30

ひとこと要約

Hyperformer は Hypergraph Self-Attention (HyperSA) とグラフ距離に基づく相対的位置エンコーディングを用いて、スケルトンデータの高次結合関係をモデル化し、NTU RGB+D、NTU RGB+D 120、Northwestern-UCLA のデータセットで最先端の結果を達成する。

ABSTRACT

Skeleton-based action recognition aims to recognize human actions given human joint coordinates with skeletal interconnections. By defining a graph with joints as vertices and their natural connections as edges, previous works successfully adopted Graph Convolutional networks (GCNs) to model joint co-occurrences and achieved superior performance. More recently, a limitation of GCNs is identified, i.e., the topology is fixed after training. To relax such a restriction, Self-Attention (SA) mechanism has been adopted to make the topology of GCNs adaptive to the input, resulting in the state-of-the-art hybrid models. Concurrently, attempts with plain Transformers have also been made, but they still lag behind state-of-the-art GCN-based methods due to the lack of structural prior. Unlike hybrid models, we propose a more elegant solution to incorporate the bone connectivity into Transformer via a graph distance embedding. Our embedding retains the information of skeletal structure during training, whereas GCNs merely use it for initialization. More importantly, we reveal an underlying issue of graph models in general, i.e., pairwise aggregation essentially ignores the high-order kinematic dependencies between body joints. To fill this gap, we propose a new self-attention (SA) mechanism on hypergraph, termed Hypergraph Self-Attention (HyperSA), to incorporate intrinsic higher-order relations into the model. We name the resulting model Hyperformer, and it beats state-of-the-art graph models w.r.t. accuracy and efficiency on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.

研究の動機と目的

グラフ距離に基づく相対位置埋め込みを介して、スケルトン構造情報を Transformer モデルへ組み込む。
Hypergraph Self-Attention (HyperSA) を導入し、ジョイント間の高次関係を捉える。
精度と効率の両面でグラフベース手法に匹敵する軽量な Transformer アーキテクチャを構築する。
骨格内のジョイントのグルーピングを自動的に学習して、より適切にエンコードすることを示す。

提案手法

訓練中に骨格構造をエンコードするため、グラフ距離に基づく相対位置エンコーディングを提案する。
Hypergraph Self-Attention (HyperSA) を開発し、ハイパーエッジを介してペアワイズおよび高次のジョイント関係を捉える。
入射行列および次数行列を用い、学習可能な射影を適用してジョイント特徴からハイパーエッジ表現を計算する。
ジョイントをグルーピングする学習可能なパーティショニング戦略を導入し、エンドツーエンド学習を可能にするためにソフトマックス緩和を用いる。
Transformer から MLP 層を除去して軽量モデルを作成し、時系列モデリングには Multi-Scale Temporal Convolution (MS-TC) を使用する。
HyperSA と時系列畳み込みモジュールを交互に積み重ねて Hyperformer アーキテクチャを構成する。

実験結果

リサーチクエスチョン

RQ1スケルトンの接続性を Transformer モデルに統合してアクション認識を改善するにはどうすれば良いか？
RQ2ハイパーグラフでモデル化される高次のジョイント関係は、ペアワイズな相互作用を超える性能向上をもたらすのか？
RQ3グラフ距離に基づく相対位置エンコーディングは、スケルトンデータの Transformer の性能を向上させ得るのか？
RQ4HyperSA を組み込んだ軽量な Transformer は、精度と効率の点で最先端のグラフベース手法と競争力があるか？

主な発見

Hyperformer は NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA のベンチマークで最先端の性能を達成した。
HyperSA は vanilla Transformer のベースラインと比較して精度を大幅に向上させ、ハイパーエッジ関係の組み込みと k-Hop RPE でさらなる改善を得た。
ジョイントグループの学習済みパーティショニング戦略は、固定の経験的パーティションより良い結果をもたらす。
MLP 層を除去し、時系列モデリングに MS-TC を使用することで、競争力のあるまたはより優れた精度を持つ軽量なモデルを得られる。
パラメータ数とFLOPsの点で効率的なままで、多くのグラフベース手法を上回る性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。