QUICK REVIEW

[논문 리뷰] Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

Jiahao Lin, Gim Hee Lee|arXiv (Cornell University)|2019. 08. 22.

Human Pose and Action Recognition인용 수 57

한 줄 요약

본 논문은 3D 포즈 시퀀스를 고정된 궤적 기저와 학습 가능한 궤적 계수로 분해된 모션 행렬로 다루는 궤적-공간 팩터라이제이션 프레임워크를 제시하여, 다중 프레임 3D 포즈 추정을 동시에 가능하게 하고 최첨단 결과를 달성한다.

ABSTRACT

Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based temporal frameworks attempt to address the sensitivity and drift problems by concurrently processing all input frames in the sequence, the existing state-of-the-art CNN-based framework is limited to 3d pose estimation of a single frame from a sequential input. In this paper, we propose a deep learning-based framework that utilizes matrix factorization for sequential 3d human poses estimation. Our approach processes all input frames concurrently to avoid the sensitivity and drift problems, and yet outputs the 3d pose estimates for every frame in the input sequence. More specifically, the 3d poses in all frames are represented as a motion matrix factorized into a trajectory bases matrix and a trajectory coefficient matrix. The trajectory bases matrix is precomputed from matrix factorization approaches such as Singular Value Decomposition (SVD) or Discrete Cosine Transform (DCT), and the problem of sequential 3d pose estimation is reduced to training a deep network to regress the trajectory coefficient matrix. We demonstrate the effectiveness of our framework on long sequences by achieving state-of-the-art performances on multiple benchmark datasets. Our source code is available at: https://github.com/jiahaoLjh/trajectory-pose-3d.

연구 동기 및 목표

영상에서의 3D 포즈 추정을 궤적-공간 인자분해를 사용해 촉진하고 RNN/CNN의 드리프트 및 데이터 효율성 한계를 해결한다.
시퀀스의 3D 포즈를 고정된 궤적 기저 행렬과 계수 행렬로 분해된 모션 행렬로 표현한다.
프레임별 포즈가 아닌 궤적 계수를 회귀함으로써 출력 차원을 축소한다.
벤치마크 데이터셋에서 긴 시퀀스에 대해 최첨단 성능을 보여준다.

제안 방법

3D 관절 시퀀스를 궤적 공간의 모션 행렬 S로 표현: S = Θ · A, 여기서 Θ는 고정 궤적 기저 행렬(F×K)이고 A는 (K×3J) 궤적 계수 행렬이다.
Θ를 미리 정의된 기저로부터 계산: 모션 데이터에서 추출한 SVD 기반의 궤적 기저 또는 Discrete Cosine Transform (DCT) 기저 중 하나.
프레임당 2D 관절 특징을 추출하고, Transformer 유사의 DCT 연산을 통해 시간 축을 궤적 공간으로 변환한 뒤, 밀집 연결된 MLP를 이용해 K개의 궤적 계수를 회귀한다.
회귀된 계수들과 함께 궤적 기저를 선형 결합해 모든 프레임의 3D 포즈를 재구성한다; 시퀀스에 대해 L1 손실로 학습한다.
추론 시에는 더 긴 비디오에서 슬라이딩 윈도우 전략을 적용하고 프레임당 여러 추정치를 평균화해 강건성을 높인다.

실험 결과

연구 질문

RQ1고정 궤적 기저 표현이 인간 모션의 본질적인 시간적 구조를 포착하여 2D 입력으로부터 다중 프레임 3D 포즈 추정을 정확하게 가능하게 하는가?
RQ2궤적 공간에서 궤적 계수를 회귀하는 것이 일반적인 형태 공간 또는 프레임별 접근법에 비해 학습 효율성과 시간적 일관성에 이점을 제공하는가?
RQ3프레임 수(F)와 기저 수(K)가 긴 시퀀스에서 재구성 정확도와 강건성에 어떤 영향을 미치는가?
RQ4제안된 궤적-공간 접근법이 표준 벤치마크(Human3.6M, MPI-INF-3DHP)의 최첨단 RNN/CNN 시간 방법과 비교해도 충분히 경쟁력이 있는가(대규모 프레임별 출력 필요 없이)?

주요 결과

다양한 프로토콜에서 Human3.6M 및 MPI-INF-3DHP에서 최첨단 성능 달성, 특히 입력 시퀀스가 더 긴 경우(F가 최대 50) 특히 두드러짐.
적은 수의 궤적 기저(K ≪ F)로도 인간 모션을 모델링할 수 있어 계수의 압축 회귀를 가능하게 함.
입력 시퀀스의 모든 프레임에 대해 단일 중심 프레임이 아닌 안정적인 3D 포즈 추정을 제공함으로써 많은 RNN 기반 시간 모델을 능가.
SVD 기반과 DCT 기반 기저 모두 경쟁력 있는 결과를 보여 기저 선택에 따른 모델의 유연성을 시사함.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.