QUICK REVIEW

[論文レビュー] HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation

Xiaoye Qian, Youbao Tang|arXiv (Cornell University)|Jan 18, 2023

Human Pose and Action Recognition被引用数 14

ひとこと要約

HSTFormerは階層的なボトムアップの空間-時間トランスフォーマーエンコーダ（STE、JTTE、BTTE、PTTE）と結合モジュールを導入し、3D HPEのためのマルチレベル結合を捉え、MPI-INF-3DHPで最先端の結果を達成し、他のデータセットでも高い性能を示す。

ABSTRACT

Transformer-based approaches have been successfully proposed for 3D human pose estimation (HPE) from 2D pose sequence and achieved state-of-the-art (SOTA) performance. However, current SOTAs have difficulties in modeling spatial-temporal correlations of joints at different levels simultaneously. This is due to the poses' spatial-temporal complexity. Poses move at various speeds temporarily with various joints and body-parts movement spatially. Hence, a cookie-cutter transformer is non-adaptable and can hardly meet the "in-the-wild" requirement. To mitigate this issue, we propose Hierarchical Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints' spatial-temporal correlations from local to global gradually for accurate 3D HPE. HSTFormer consists of four transformer encoders (TEs) and a fusion module. To the best of our knowledge, HSTFormer is the first to study hierarchical TEs with multi-level fusion. Extensive experiments on three datasets (i.e., Human3.6M, MPI-INF-3DHP, and HumanEva) demonstrate that HSTFormer achieves competitive and consistent performance on benchmarks with various scales and difficulties. Specifically, it surpasses recent SOTAs on the challenging MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly generalized systematic approach. The code is available at: https://github.com/qianxiaoye825/HSTFormer.

研究の動機と目的

wildでの堅牢な3D HPEを、関節の多段階空間-時間相関をモデル化することによって動機づける。
ローカルな関節からグローバルなポーズへ情報を伝搬する階層的トランスフォーマーフレームワークを提案する。
グループ化された関節を跨ぐ時系列を捉えるボディ-パート時系列トランスフォーマを導入する。
融合モジュールを用いて多段階時系列特徴を統合し、3Dポーズ回帰を改善する。
挑戦的な屋外シーンを含む複数データセットで一般化を示す。

提案手法

2Dポーズ系列を高次元特徴に埋め込む。
4つのトランスフォーマーエンコーダを使用する：Spatial Transformer Encoder（STE）、Joint Temporal Transformer Encoder（JTTE）、Body-Part Temporal Transformer Encoder（BTTE）、Pose Temporal Transformer Encoder（PTTE）。
階層的なボトムアップ構造を適用して、Spatial Correlation（SC）、Joint Temporal Correlation（JTC）、Body-part Temporal Correlation（BTC）、Pose Temporal Correlation（PTC）を学習する。
学習可能な重みを持つ融合モジュールを用いて、全エンコーダの出力を適応的に統合する。
融合特徴量から回帰ヘッドを用いてMPJPE損失で3Dポーズを予測する。

実験結果

リサーチクエスチョン

RQ1階層的な空間-時間トランスフォーマーフレームワークは、3D HPEのための多段階の関節相関を効果的にモデル化できるか。
RQ2関節からボディパーツへ、さらにポーズへとボトムアップの情報伝搬は3Dリフティングの精度を改善するか。
RQ3多段階特徴の適応的融合はデータセット間のMPJPEにどのように影響するか。
RQ4フレーム間でのグループ化された関節のためのボディ-パートレベルの時系列モデル化は有益か。
RQ5現実世界のデータや小規模データセットに対して、本手法はSOTAと比較してどれほど一般化しているか。

主な発見

HSTFormerはHuman3.6MとMPI-INF-3DHPで競争力のある結果を達成し、特にMPI-INF-3DHPでは複数の設定でSOTAを上回る。
ボディ-パート時系列トランスフォーマー（BTTE）はMPJPEを大幅に削減し、ある主張で11.8%の改善に寄与した。
STE、JTTE、BTTE、PTTEの階層的蓄積と融合により、MPJPEが順次低下（アブレーションで41.6 mmから30.6 mmへ）。
本モデルは一般化性能が高く、挑戦的な屋外/深部シーンやHumanEvaのような小規模データセットでより大きな利得を示す。
HSTFormerはMPI-INF-3DHPのMPJPEをMixSTEと比較して最大で24.6%低減（設定T=81）する。
アブレーション研究は、局所からグローバルへの集合が、グローバルから局所への順序より優れることを確認した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。