QUICK REVIEW

[論文レビュー] Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

Fabien Baradel, Christian Wolf|arXiv (Cornell University)|Mar 29, 2017

Human Pose and Action Recognition参考文献 51被引用数 73

ひとこと要約

この論文は姿勢ベースのCNNと、RGB動画上で姿勢条件付けされた時空注意機構を組み合わせ、NTU-RGB+DとSBU Kinect Interactionで最先端の結果を達成し、小規模データセットへの知識転移を実現する。

ABSTRACT

We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.

研究の動機と目的

多模データ（関節姿勢とRGBフレーム）を用いて正確な行動認識を動機づける。
姿勢を意識した二重ストリーム構造を開発し、姿勢を用いてRGBの注意機構を導く。
時間にわたる畳み込み表現へ姿勢情報をエンコードする。
適応的な時系列プーリングを有効にして時系列情報の融合を効果的に行う。
大規模データセットから小規模ベンチマークへの学習表現の転移を示す。

提案手法

CNN処理のためにトポロジーに基づく関節順序に沿って整理された3Dテンソルとして姿勢データをエンコードする。
姿勢部分列から階層的な姿勢特徴を抽出する姿勢のみCNNを使用する。
4つの手に焦点を合わせる学習可能なグリンプスセンサーを用い、姿勢特徴に条件付けられたRGBフレーム上の空間注意機構を実装する。
グリンプスの出力をLSTMで処理し、各時刻で手の間の注意ベース融合を統合する。
時系列注意機構を適用してLSTM特徴を時間的に適応プーリングする。
ロジットレベルで姿勢とRGBストリームを融合し、エンドツーエンドで学習する（RGBストリームの学習中は一部のコンポーネントを凍結）。

実験結果

リサーチクエスチョン

RQ1姿勢駆動特徴をRGBビデオと組み合わせた注意機構によって、行動認識を改善できるか？
RQ2RGBの空間注意を姿勢特徴に条件付けすることで、手や操作対象物などの有益な領域へのモデルの焦点が改善されるか？
RQ3時系列注意は時間を超えて特徴を効果的に融合できるか？
RQ4大規模データセット（NTU）からの知識転移は、小規模データセット（MSR Daily Activity 3D, SBU Kinect Interaction）に有益か？
RQ5関節順序と姿勢表現が認識性能に与える影響はどのようか？

主な発見

Method	CS	CV	Avg
Lie Group	50.1	52.8	51.5
Skeleton Quads	38.6	41.4	40.0
Dynamic Skeletons	60.2	65.2	62.7
HBRNN	59.1	64.0	61.6
Deep LSTM	60.7	67.3	64.0
Part-aware LSTM	62.9	70.3	66.6
ST-LSTM + TrustG.	69.2	77.7	73.5
STA-LSTM	73.4	81.2	77.2
JTM	76.3	81.1	78.7
DSSCA - SSLM	74.9	-	-
Ours (pose only)	90.5	-	-
Ours (RGB only)	72.0	-	-
Ours (pose + RGB)	94.1	-	-

姿勢のみモデルおよび姿勢+RGBモデルの双方でNTU RGB+Dにおいて最先端の結果を達成。
完全なモデルでSBU Kinect Interactionデータセットで最先端の結果を達成。
MSR Daily Activity 3Dでの性能は競争力があり、非常に小規模なデータセットの難しさを示している。
トポロジーに基づく隣接性を保持するシーケンスによる関節順序は、ランダム順序に比べNTUで>1ポイントの改善をもたらす。
姿勢条件付き空間注意はRGBのみの性能を著しく向上させ、マルチモーダル設定と比較してRGBのみ設定でより大きな利得を得る（約1–12ポイント）。
NTUからMSRおよびSBUへの知識転移は小規模データセットで意味のある改善を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。