QUICK REVIEW

[论文解读] Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

Jiahao Lin, Gim Hee Lee|arXiv (Cornell University)|Aug 22, 2019

Human Pose and Action Recognition被引用 57

一句话总结

论文提出一种轨迹空间分解框架，将3D姿态序列视为一个运动矩阵，分解为固定轨迹基与可学习的轨迹系数，从而实现多帧3D姿态估计并达到最先进的结果。

ABSTRACT

Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based temporal frameworks attempt to address the sensitivity and drift problems by concurrently processing all input frames in the sequence, the existing state-of-the-art CNN-based framework is limited to 3d pose estimation of a single frame from a sequential input. In this paper, we propose a deep learning-based framework that utilizes matrix factorization for sequential 3d human poses estimation. Our approach processes all input frames concurrently to avoid the sensitivity and drift problems, and yet outputs the 3d pose estimates for every frame in the input sequence. More specifically, the 3d poses in all frames are represented as a motion matrix factorized into a trajectory bases matrix and a trajectory coefficient matrix. The trajectory bases matrix is precomputed from matrix factorization approaches such as Singular Value Decomposition (SVD) or Discrete Cosine Transform (DCT), and the problem of sequential 3d pose estimation is reduced to training a deep network to regress the trajectory coefficient matrix. We demonstrate the effectiveness of our framework on long sequences by achieving state-of-the-art performances on multiple benchmark datasets. Our source code is available at: https://github.com/jiahaoLjh/trajectory-pose-3d.

研究动机与目标

通过使用轨迹空间分解来进行视频中的3D姿态估计，解决RNN/CNN的漂移和数据效率不足问题。
将3D姿态序列表示为一个运动矩阵，分解为固定轨迹基和系数矩阵。
通过回归轨迹系数而非逐帧姿态来降低输出维度。
展示在具有长序列的基准数据集上的最先进性能。

提出的方法

将3D关节序列表示为轨迹空间中的运动矩阵S：S = Θ · A，其中Θ是固定轨迹基矩阵（F×K），A是(K×3J)轨迹系数矩阵。
通过预定义基来计算Θ：可以是从运动数据中提取的基于SVD的轨迹基，或离散余弦变换(DCT)基。
通过提取逐帧2D关节特征，通过类似Transformer的DCT操作将时间通道转换到轨迹空间，并使用密集连接的MLP回归K个轨迹系数。
通过将轨迹基线与回归系数线性组合来重建所有帧的3D姿态；对序列训练时使用L1损失。
推理阶段，对更长的视频应用滑动窗口策略，并对每帧的多个估计结果取平均以提高鲁棒性。

实验结果

研究问题

RQ1固定轨迹基表示是否能够捕捉人类运动的基本时序结构，从而在2D输入下实现对多帧3D姿态的准确估计？
RQ2在轨迹空间回归轨迹系数是否在培训效率和时序一致性方面优于传统的形状空间或逐帧方法？
RQ3帧数（F）和基数（K）如何影响长序列中的重建准确性和鲁棒性？
RQ4在标准基准（Human3.6M, MPI-INF-3DHP）上，所提出的轨迹空间方法是否与最先进的RNN/CNN时序方法竞争，同时无需大量逐帧输出？

主要发现

在不同协议下在Human3.6M和MPI-INF-3DHP上达到最先进的性能，特别是在输入序列较长时（F up to 50）。
证明少量轨迹基（K ≪ F）足以建模人类运动，从而实现系数的紧凑回归。
通过为输入序列的所有帧生成稳定的3D姿态估计，而不是仅一个中心帧，超越了许多基于RNN的时序模型。
显示SVD派生和DCT基都能得到具竞争力的结果，表明基选择具有模型灵活性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。