QUICK REVIEW

[论文解读] Towards Accurate Multi-person Pose Estimation in the Wild

George Papandreou, Tyler Zhu|arXiv (Cornell University)|Jan 6, 2017

Human Pose and Action Recognition参考文献 42被引用 76

一句话总结

一个两阶段的自上而下系统，先用 Faster-RCNN 进行人框检测，然后再使用基于 CNN 的姿态估计器预测 17 个关键点的热图和偏移，结合基于 OKS 的 NMS 和基于姿态的再评分，以在 COCO 关键点上达到最先进的结果。

ABSTRACT

We propose a method for multi-person detection and 2-D pose estimation that achieves state-of-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring. Trained on COCO data alone, our final system achieves average precision of 0.649 on the COCO test-dev set and the 0.643 test-standard sets, outperforming the winner of the 2016 COCO keypoints challenge and other recent state-of-art. Further, by using additional in-house labeled data we obtain an even higher average precision of 0.685 on the test-dev set and 0.673 on the test-standard set, more than 5% absolute improvement compared to the previous best performing method on the same dataset.

研究动机与目标

解决在“野外”无提供人物位置的无约束图像中的多人物姿态估计问题。
开发一个鲁棒的两阶段流水线，将检测与姿态估计结合起来。
通过关键点分数和 OKS 的非最大抑制来提升最终排序。

提出的方法

阶段 1：使用带有 atrous 卷积的 ResNet-101 主干的 Faster-RCNN 来检测人体边界框。
阶段 2：对每个提议框进行裁剪，通过一个全卷积的 ResNet 预测每个关键点的热图和 2-D 偏移（K=17 关键点）。
预测热图 h_k(x_i) 和偏移 F_k(x_i)；通过基于圆盘的投票方案聚合得到精确的关键点位置 f_k。
用热图和偏移的联合损失进行训练；对偏移使用 Huber 损失；在中间层引入辅助损失以稳定训练。
使用基于姿态的分数对每个姿态提议进行重新评分：score(I) = (1/K) ∑_k max_x_i f_k(x_i)。
在姿态层应用基于 OKS 的 NMS（OKS-NMS），以更好地区分相邻的人。

实验结果

研究问题

RQ1一个两阶段的自上而下流水线（检测 + 姿态估计）是否能在野外的多人物姿态估计中超过自下而上的方法？
RQ2热图+偏移的表示和霍夫类投票在拥挤场景中如何影响关键点定位精度？
RQ3姿态基于的重新评分和 OKS 基于 NMS 是否比基于框的评分和 IoU NMS 能提升 COCO 关键点指标？
RQ4训练数据（仅 COCO 与 COCO+内部数据）以及骨干网络/裁剪大小对 COCO 关键点 AP 的影响？
RQ5不同的边框检测器和姿态估计器对整体性能的影响？

主要发现

AP	AP .5	AP .75	AP (M)	AP (L)	AR	AR .5	AR .75	AR (M)	AR (L)
CMU-Pose [8]	0.618	0.849	0.675	0.571	0.682	0.665	0.872	0.718	0.606	0.746
Mask-RCNN [21]	0.631	0.873	0.687	0.578	0.714	-	-	-	-	-
G-RMI (ours): COCO-only	0.649	0.855	0.713	0.623	0.700	0.697	0.887	0.755	0.644	0.771
G-RMI (ours): COCO+int	0.685	0.871	0.755	0.658	0.733	0.733	0.901	0.795	0.681	0.804
AP	AP .5	AP .75	AP (M)	AP (L)	AR	AR .5	AR .75	AR (M)	AR (L)
CMU-Pose [8]	0.611	0.844	0.667	0.558	0.684	0.665	0.872	0.718	0.602	0.749
G-RMI (ours): COCO-only	0.643	0.846	0.704	0.614	0.696	0.698	0.885	0.755	0.644	0.771
G-RMI (ours): COCO+int	0.673	0.854	0.735	0.642	0.726	0.730	0.898	0.789	0.675	0.805

在 COCO test-dev 上，仅 COCO 的训练得到 AP 0.649，test-standard AP 0.643，超越了 2016 年挑战赛冠军和 Mask R-CNN 变体。
若加入额外的内部标注数据，AP 提高到 0.685（test-dev）和 0.673（test-standard）。
OKS-NMS 与基于姿态的重新评分相较于基于框评分的排序和标准 IoU NMS 显著提升 AP。
消融研究显示更强的边框检测器和更高分辨率的姿态裁剪（ResNet-101，353×257）能带来更高的 AP（COCO+int 与 353×257 时为 0.685）。
使用 COCO+int 数据训练的姿态估计器相较于仅 COCO 数据提供了显著提升（在 test-standard 上 AP 高达 0.673）。
单尺度评估仅使用一个 CNN 做检测，一个 CNN 做姿态估计就已达到最先进的结果；多尺度/集成可能带来进一步提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。