QUICK REVIEW

[论文解读] Flowing ConvNets for Human Pose Estimation in Videos

Tomas Pfister, James Charles|arXiv (Cornell University)|Jun 9, 2015

Human Pose and Action Recognition参考文献 37被引用 89

一句话总结

本文提出了一种 Flowing ConvNet 架构，利用光流在时间上对齐多个视频帧的热力图预测结果，从而提升人体姿态估计的准确性。通过整合更深层的特征提取、用于建模身体部位之间关系的空间融合层，以及可学习的池化层以加权融合形变后的热力图，该方法在三个视频姿态估计数据集上实现了最先进性能，包括在 Poses in the Wild 数据集上将手腕关键点的准确率提升了 30%（d=8 时）。

ABSTRACT

The objective of this work is human pose estimation in videos, where multiple frames are available. We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow. To this end we propose a network architecture with the following novelties: (i) a deeper network than previously investigated for regressing heatmaps; (ii) spatial fusion layers that learn an implicit spatial model; (iii) optical flow is used to align heatmap predictions from neighbouring frames; and (iv) a final parametric pooling layer which learns to combine the aligned heatmaps into a pooled confidence map. We show that this architecture outperforms a number of others, including one that uses optical flow solely at the input layers, one that regresses joint coordinates directly, and one that predicts heatmaps without spatial fusion. The new architecture outperforms the state of the art by a large margin on three video pose estimation datasets, including the very challenging Poses in the Wild dataset, and outperforms other deep methods that don't use a graphical model on the single-image FLIC benchmark (and also Chen & Yuille and Tompson et al. in the high precision region).

研究动机与目标

通过利用多帧之间的时序上下文，提升视频中的人体姿态估计性能。
通过隐式建模身体部位之间的空间关系，解决运动学上不一致的姿态预测问题。
通过使用光流将相邻帧的预测结果形变，提升热力图的置信度与准确性。
超越未显式建模时序一致性或空间关系的现有深度学习方法。
通过参数化池化层端到端学习时序融合权重，验证其有效性。

提出的方法

采用更深层的 ConvNet 架构来回归关节点热力图，从初始热力图预测扩展到学习人体身体布局的隐式空间模型。
引入空间融合层以建模身体部位之间的依赖关系，减少运动学上不可能的姿态配置。
利用光流将相邻帧的热力图预测结果形变到当前帧，实现在图像空间中的时序预测对齐。
通过参数化池化层学习结合形变后的热力图，关注时间维度上最置信的预测结果。
整个网络通过反向传播进行端到端训练，实现特征学习、基于光流的对齐与融合的联合优化。
采用全卷积设计处理多帧视频片段，并将关节点位置作为池化后热力图中峰值的位置进行预测。

实验结果

研究问题

RQ1光流能否有效用于对齐视频帧之间热力图的预测结果，从而提升姿态估计性能？
RQ2通过额外的卷积层学习身体部位之间关系的空间模型，是否能减少运动学上不一致的姿态预测？
RQ3一种可学习的池化机制，能够融合来自多帧的形变热力图，是否能优于简单的平均或早期输入帧融合？
RQ4所提出的架构在具有挑战性的视频姿态估计基准测试中，与当前最先进方法相比表现如何？
RQ5将光流与空间融合相结合，在姿态和外观变化较大的数据集中，能在多大程度上提升性能？

主要发现

在 Poses in the Wild 数据集上，Flowing ConvNet 在 d=8 时将手腕关键点的性能相比之前最先进方法提升了 30%，肘部提升了 24%。
在使用光流的情况下，手腕关键点在 d=8 时提升了 10%，肘部提升了 13%，证明了时序对齐的价值。
即使不使用光流，该模型在 ChaLearn 数据集上 d=6 时仍比最先进方法高出 3.5%，而通过使用更深的网络，额外获得了 13% 的性能提升。
在 FLIC 基准测试中，与非图模型方法相比，该方法在 d=0.05 时准确率提升了 20%，并在高精度区域与图模型方法相当或略优。
空间融合层通过强制执行运动学一致性，有效解决了多峰值热力图模式导致的失败案例，如定性失败分析所示。
所提出的架构在三个主要视频姿态估计数据集（BBC Pose、ChaLearn 和 Poses in the Wild）上均实现了最先进性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。