QUICK REVIEW

[論文レビュー] AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Md Mushfiqur Azam, John Quarles|arXiv (Cornell University)|Mar 26, 2026

Human Pose and Action Recognition被引用数 0

ひとこと要約

AG-EgoPoseは、2D関節ヒートマップとアクション誘導モーション特徴を、学習可能な関節トークンを持つトランスフォーマーdecoderを介して統合するデュアルストリームのエゴセントリック3Dポーズ推定器であり、EgoPWとSceneEgoデータセットで最先端の結果を達成します。

ABSTRACT

Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.

研究の動機と目的

極度の視点歪みと遮蔽下でのエゴセントリックフィisheyeビデオからの頑健な3Dポーズ推定を動機づける。
短距離および長距離のモーションコンテキストをアクション情報として活用し、ポーズの曖昧性を解消する。
空間的精度を保ちつつ時間的ダイナミクスを統合するジョイントレベルのフュージョン機構を開発する。
解剖学的整合性のある3Dポーズ回帰のための学習可能なジョイントトークンを持つトランスフォーマー型デコーダを提案する。
EgoPWおよびSceneEgoベンチマークで最先端の性能を示す。

提案手法

空間ヒートマップストリームとモーションベースの時間ストリームというデュアルストリーム構成。
空間ストリームは重みを共有するResNet-18エンコーダ-デコーダを用いて2Dジョイントヒートマップと各ジョイント埋め込みを生成する。
モーションストリームはResNet-50バックボーンとActionFormerベースの時間エンコーダを用いて短距離・長距離のモーションダイナミクスを捉える。
ヒートマップは各ジョイントトークンに埋め込まれ、モーション特徴とジョイントレベルのメモリに融合される。
学習可能なジョイントトークンを持つトランスフォーマーデコーダがメモリを参照して3Dジョイント位置を回帰する。
損失はジョイント位置誤差と骨長・骨方位正則化を組み合わせ、解剖学的妥当性を強制する。

実験結果

リサーチクエスチョン

RQ1短距離および長距離の時間文脈は、遮蔽や視点歪みにおいてエゴセントリック3Dポーズ推定をどう改善できるか？
RQ2アクション誘導モーション特徴を空間ヒートマップの証拠と効果的に融合して3Dポーズ精度を向上させられるか？
RQ3ジョイントトークンベースのトランスフォーマーデコーダーは、空間的および時間的手掛かりの堅牢なジョイント特異的フュージョンを可能にするか？

主な発見

Method	MPJPE (mm)	PA-MPJPE (mm)	Notes
Mo2Cap2	200.3	121.2	< EgoScene comparison on SceneEgo table; not the EgoPW primary PA-MPJPE table>
xR-EgoPose	241.3	133.9	SceneEgo results from Table 2
EgoPW	189.6	105.3	Baseline EgoPW result (MPJPE/PA-MPJPE)
SceneEgo	118.5	92.7	State-of-the-art before Ours on SceneEgo
Ours	104.0	76.2	Ours on SceneEgo (MPJPE/PA-MPJPE)

EgoPWではPA-MPJPEが84.2 mmから76.7 mmに改善（表の中で最先端）。
SceneEgoではMPJPE/PA-MPJPEがそれぞれ104.0 mmと76.2 mmで、従来手法を上回る。
データセットを越えたSceneEgoへの転移は104.0/76.2 mm（MPJPE/PA-MPJPE）で、従来ベストの118.5/92.7 mmに対して強い一般化を示す。
アブレーション実験では、空間ヒートマップとモーション特徴の結合は、いずれかのストリーム単独より良い結果（EgoPWのPA-MPJPE: 90.8から76.7へ、SceneEgo: 113.2/80.8から104.0/76.2へ）。
クロスアテンション融合は重要であり、これを除くと性能が低下する（例：EgoPWでクロスアテンションなしの場合は83.1 PA-MPJPE）。
ヒートマップの事前学習をBCEWithLogitsLossで行うと、MSEより下流の3Dポーズ精度が改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。