QUICK REVIEW

[論文レビュー] Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

Jiahao Lin, Gim Hee Lee|arXiv (Cornell University)|Aug 22, 2019

Human Pose and Action Recognition被引用数 57

ひとこと要約

本論文は、3Dポーズ列を動作マトリクスに分解する軌道空間因子化フレームワークを紹介し、固定の軌道ベースと学習可能な軌道係数に分解することで、複数フレームの3Dポーズ推定を同時に可能にし、最先端の結果を実現します。

ABSTRACT

Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based temporal frameworks attempt to address the sensitivity and drift problems by concurrently processing all input frames in the sequence, the existing state-of-the-art CNN-based framework is limited to 3d pose estimation of a single frame from a sequential input. In this paper, we propose a deep learning-based framework that utilizes matrix factorization for sequential 3d human poses estimation. Our approach processes all input frames concurrently to avoid the sensitivity and drift problems, and yet outputs the 3d pose estimates for every frame in the input sequence. More specifically, the 3d poses in all frames are represented as a motion matrix factorized into a trajectory bases matrix and a trajectory coefficient matrix. The trajectory bases matrix is precomputed from matrix factorization approaches such as Singular Value Decomposition (SVD) or Discrete Cosine Transform (DCT), and the problem of sequential 3d pose estimation is reduced to training a deep network to regress the trajectory coefficient matrix. We demonstrate the effectiveness of our framework on long sequences by achieving state-of-the-art performances on multiple benchmark datasets. Our source code is available at: https://github.com/jiahaoLjh/trajectory-pose-3d.

研究の動機と目的

RNNs/CNNsのドリフトとデータ効率の欠点に対処するため、軌道空間因子分解を用いて映像からの3Dポーズ推定を動機づける。
3Dポーズのシーケンスを、固定の軌道ベースと係数行列に分解されたモーションマトリクスとして表現する。
フレームごとのポーズではなく軌道係数を回帰させることで、出力次元を削減する。
長いシーケンスを含むベンチマークデータセットで最先端の性能を示す。

提案手法

3D関節列を軌道空間のモーションマトリクス S として表現する: S = Θ · A, ここで Θ は固定の軌道ベース行列 (F×K) で、A は (K×3J) の軌道係数行列である。
Θを事前定義されたベースから計算する: motion data から抽出した SVD ベースの軌道ベースまたは Discrete Cosine Transform (DCT) ベースのいずれか。
フレームごとに2D関節特徴を抽出し、時間的チャンネルを Transformer に似た DCT 操作によって軌道空間へ変換し、密に接続された MLP を用いて K 個の軌道係数を回帰する。
回帰した係数と軌道ベースを線形結合して全フレームの3Dポーズを再構築する；シーケンス全体で L1 損失を用いて訓練する。
推論時には長いビデオに対してスライディングウィンドウ戦略を適用し、フレームごとの複数の推定を平均化して頑健性を向上させる。

実験結果

リサーチクエスチョン

RQ1固定の軌道ベース表現は、人間の動作の本質的な時間構造を捉え、2D入力からの複数フレームの3Dポーズ推定を正確に実現できるか？
RQ2軌道空間での軌道係数回帰は、従来の形状空間やフレームごとアプローチと比べて、訓練効率と時間的一貫性の利点を提供するか？
RQ3長いシーケンス全体で、フレーム数 (F) と基底数 (K) は再構成精度と頑健性にどう影響するか？
RQ4提案された軌道空間アプローチは、標準ベンチマーク（Human3.6M、MPI-INF-3DHP）で、豊富な各フレーム出力を必要とせず、最先端の RNN/CNN 時系列手法と競合できるか？

主な発見

さまざまなプロトコルで Human3.6M および MPI-INF-3DHP において最先端の性能を達成し、特に長い入力シーケンス（F up to 50）で顕著。
少数の軌道ベース（K ≪ F）で人間の運動をモデルすることが十分であり、係数のコンパクトな回帰を可能にすることを示す。
入力シーケンスの全フレームに対して安定した3Dポーズ推定を生成し、単一の中心フレームではなく、RNNベースの多くの時系列モデルを上回る。
SVD由来のベースと DCTベースの両方が競争力のある結果を示し、基底選択に対するモデルの柔軟性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。