QUICK REVIEW

[论文解读] Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

Jie Yang, Ailing Zeng|arXiv (Cornell University)|Feb 3, 2023

Human Pose and Action Recognition被引用 16

一句话总结

ED-Pose 提出一个完全端到端的框架，使用显式的人体和关键点框检测来统一全局与局部姿态信息，在 CrowdPose 上达到最先进的结果，并在 COCO 上实现强劲性能且无需后处理。

ABSTRACT

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.

研究动机与目标

通过统一全局（人级）与局部（关键点级）线索，推动无需后处理的端到端多人物体姿态估计。
提出两个显式框检测解码器（人体与人体到关键点），以实现连贯的全局-局部学习。
展示显式框检测在 COCO 和 CrowdPose 上加速收敛并提升精度。
在不同骨干网络下，与单阶段、两阶段和基于 DETR 的方法相比，展现具有竞争力或更优的性能。

提出的方法

引入具有人体检测解码器和人体到关键点检测解码器的 ED-Pose，以预测人员和关键点的显式框。
将人体和关键点都表示为框预测（(x,y,h,w)），并使用统一的基于 L1 的回归损失和匈牙利集合匹配进行优化。
使用由粗到细的查询选择来初始化并细化人体查询，随后进行人体到关键点的查询扩展以预测关键点框。
在人体和关键点检测之间采用交互式学习，将全局上下文传播到局部关键点预测。
端到端训练，不使用密集热图监督或后处理，在两个阶段之间使用共享/基于回归的损失机制。
在 COCO 和 CrowdPose 上与自上而下、从下到上和基于 DETR 的方法进行对比，以展示效率和精度优势。

实验结果

研究问题

RQ1人体和关键点的显式框检测能否实现无需后处理的完全端到端姿态估计框架？
RQ2统一的框表示和一致的 L1 回归损失是否提升多人物体姿态估计的收敛速度和精度？
RQ3在端到端框架中，全局（人体）与局部（关键点）的依赖关系如何交互以处理遮挡和拥挤场景？
RQ4在 COCO 和 CrowdPose 上使用显式框检测相较于现有方法的性能提升有哪些？

主要发现

显式人体框检测显著提高收敛性和精度（COCO 上 +4.5 AP，CrowdPose 上 +9.9 AP）。
在同一骨干网络下，ED-Pose 在 COCO 上比可比的基于热图的自上而下方法高出 1.2 AP，并显著超越 PETR。
在 CrowdPose 上，ED-Pose 在不进行多尺度测试或翻转的情况下达到 76.6 AP，达到最先进的结果。
与基于 DETR 的方法相比，ED-Pose 展现出更快的收敛速度和更高的精度，并且在无需后处理的端到端性能上具有优势。
在 Swin-L 骨干下，ED-Pose 在 COCO val/test-dev 上达到 75.8 AP，在 CrowdPose 上达到 76.6 AP，且使用 Swin-L 且无花哨设置。
消融实验证实显式人体检测、关键点框表示 (x,y,w,h) 相较于简单的 (x,y) 和人体与关键点之间的交互学习的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。