QUICK REVIEW

[論文レビュー] TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation

Haoyu Ma, Liangjian Chen|arXiv (Cornell University)|Oct 18, 2021

Human Pose and Action Recognition参考文献 66被引用数 36

ひとこと要約

TransFusion を紹介する。複数視点からの手掛かりを統合する軽量な transformer ベースのフレームワークで、エピポラ場と幾何学的位置エンコーディングを用いて視点間対応を符号化し、2D 姿勢の refinements および 3D 姿勢推定を行う。

ABSTRACT

Estimating the 2D human poses in each view is typically the first step in calibrated multi-view 3D pose estimation. But the performance of 2D pose detectors suffers from challenging situations such as occlusions and oblique viewing angles. To address these challenges, previous works derive point-to-point correspondences between different views from epipolar geometry and utilize the correspondences to merge prediction heatmaps or feature representations. Instead of post-prediction merge/calibration, here we introduce a transformer framework for multi-view 3D pose estimation, aiming at directly improving individual 2D predictors by integrating information from different views. Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion, to fuse cues from both current views and neighboring views. Moreover, we propose the concept of epipolar field to encode 3D positional information into the transformer model. The 3D position encoding guided by the epipolar field provides an efficient way of encoding correspondences between pixels of different views. Experiments on Human 3.6M and Ski-Pose show that our method is more efficient and has consistent improvements compared to other fusion methods. Specifically, we achieve 25.8 mm MPJPE on Human 3.6M with only 5M parameters on 256 x 256 resolution.

研究の動機と目的

クロスビュー融合を通じて直接 2D 検出器を強化することにより、3D 多視点人間姿勢推定を改善する。
エピポラリラインの制約を超えて、現在の視点と近隣視点からの情報を統合するために transformer を活用する。
クロビューの 3D 対応を符号化するために Epipolar Field と Geometry Positional Encoding を導入する。
標準的な多視点データセットにおいて、従来の融合法よりも効率性と精度の向上を実証する。

提案手法

2 視点の CNN バックボーンが各視点から低レベル特徴を抽出する。
共有の transformer エンコーダが 2D 正弦位置エンコーディングと Geometry Positional Encoding (GPE) で両視点の特徴を融合する。
エピポラ Field はエピポラ線を超えるクロビューの画素対応をモデル化してクロビューのアテンションを導き、学習されたポーズ損失 L_pos によって強制される。
GPE はカメラ中心からの単位光線を用いて3D方向情報を符号化し、3D対応を考慮したアテンションを可能にする。
予測ヘッドが各視点の2D 関節ヒートマップを出力し、トレーニングはヒートマップ MSE に加えてクロビュー対応損失を最適化する。

実験結果

リサーチクエスチョン

RQ1複数視点の情報を活用することで、統一された transformer ベースのアーキテクチャは 2D 姿勢検出を改善できるか？
RQ2ジオメトリ認識位置エンコーディングとエピポラ場の指向を取り入れることで、3D 姿勢推定のためのクロビューアテンションは強化されるか？
RQ3標準データセットで、精度とパラメータ効率の面で TransFusion は従来の融合法とどう比較されるか？

主な発見

Method	Params	MACs	Inference Time (s)	JDR (%) ↑	MPJPE (mm) ↓
Single view - Simple Baseline	34M	51.7G	-	98.5	30.2
Single view - TransPose	5M	43.6G	-	98.6	30.5
Crossview Fusion	235M	55.1G	0.048	99.4	27.8
Epipolar Transformer	34M	51.7G	0.086	98.6	27.1
TransFusion	5M	50.2G	0.032	99.4	25.8

TransFusion は Human3.6M および Ski-Pose で、2D および 3D 姿勢指標において従来の融合法と同等または上回る。
Human3.6M で 25.8 mm MPJPE を、わずか 5M パラメータで達成し、競合する融合手法より大幅に少ない。
エピポラ線に制限せず、参照ビュー全体から情報を融合することで、遮蔽が多いシーケンスで性能が向上する。
アブレーション研究は、3D ジオメトリポジショナルエンコーディングとエピポラ-field ガイダンスがクロビュー対応と3D精度にとって重要であることを示している。
Ski-Pose では、TransFusion は単一視点のベースラインを上回り、巨大な融合モデルと比べて軽量である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。