QUICK REVIEW

[論文レビュー] DirectPose: Direct End-to-End Multi-Person Pose Estimation

Zhi Tian, Hao Chen|arXiv (Cornell University)|Nov 18, 2019

Human Pose and Action Recognition参考文献 29被引用数 80

ひとこと要約

DirectPose は、境界ボックス検出や後処理のグルーピングなしに、直接的なインスタンス認識キーポイント予測のための完全なエンドツーエンドのシングルショットフレームワークを提供します。新規の Keypoint Alignment (KPAlign) モジュールとトレーニング時のオプションのヒートマップベース正則化により支援されます。

ABSTRACT

We propose the first direct end-to-end multi-person pose estimation framework, termed DirectPose. Inspired by recent anchor-free object detectors, which directly regress the two corners of target bounding-boxes, the proposed framework directly predicts instance-aware keypoints for all the instances from a raw input image, eliminating the need for heuristic grouping in bottom-up methods or bounding-box detection and RoI operations in top-down ones. We also propose a novel Keypoint Alignment (KPAlign) mechanism, which overcomes the main difficulty: lack of the alignment between the convolutional features and predictions in this end-to-end framework. KPAlign improves the framework's performance by a large margin while still keeping the framework end-to-end trainable. With the only postprocessing non-maximum suppression (NMS), our proposed framework can detect multi-person keypoints with or without bounding-boxes in a single shot. Experiments demonstrate that the end-to-end paradigm can achieve competitive or better performance than previous strong baselines, in both bottom-up and top-down methods. We hope that our end-to-end approach can provide a new perspective for the human pose estimation task.

研究の動機と目的

境界ボックス検出とキーポイントのグルーピングを迂回する、マルチ人のポーズ推定に向けた直接的なエンドツーエンドアプローチを動機づける。
エンドツーエンド訓練可能なパイプラインを導入することで、微分不可能な後処理を排除する。
特徴予測整合（KPAlign）を通じてキーポイントの局所化精度を向上させる。
COCO上で強力なtop-downおよびbottom-upベースラインと競争力を示す。

提案手法

アンカーなしFCOS検出器を拡張して、インスタンスごとにK個のキーポイントの座標を2K個回帰するキーポイント検出ヘッドを導入する。
ロケータ（サンプリング位置）と予測子（各キーポイントの回帰）を介して、局所特徴を予測キーポイントと対になるように整合させるKPAlignを導入する。
微分可能なサンプリングと整合により、エンドツーエンドの回帰ベースのキーポイント予測を可能にする。
トレーニング時に回帰学習を正則化するためのオプションのヒートマップベースの補助タスク（テスト時には削除）。
計算量を削減し性能を向上させるために、グルーピングや分離された特徴マップを用いた実験を行う。
境界ボックス検出の有無で評価し、COCOの最先端のtop-downおよびbottom-up手法と比較する。

実験結果

リサーチクエスチョン

RQ1Can an end-to-end single-stage framework directly map an input image to instance-aware keypoints without bounding-box detection or RoI-based operations?
RQ2Does a feature-keypoint alignment (KPAlign) significantly improve end-to-end keypoint regression performance?
RQ3How does the end-to-end approach compare to traditional top-down and bottom-up methods on COCO in terms of accuracy and speed?
RQ4What is the impact of auxiliary heatmap learning on the regression-based keypoint predictions during training?
RQ5Is the method robust when optionally combined with bounding-box detection for shared tasks?

主な発見

Method	AP kp	AP kp_50	AP kp_75	AP kp_M	AP kp_L
Ours (R-50)	62.2	86.4	68.2	56.7	69.8
Ours (R-50) †	63.0	86.8	69.3	59.1	69.3
Ours (R-101)	63.3	86.7	69.4	57.8	71.2
Ours (R-101) †	64.8	87.8	71.1	60.4	71.5

End-to-end DirectPose with KPAlign achieves competitive keypoint AP on COCO compared to strong baselines.
KPAlign provides a large performance boost over naive end-to-end keypoint regression (over 7 AP points in most ablations).
Grouped KPAlign and separate feature maps further improve accuracy with modest computational trade-offs.
Joint heatmap learning as an auxiliary training task significantly improves regression-based keypoint AP (e.g., from 52.2 to 58.0 AP with 8x heatmaps).
Without bells-and-whistles, DirectPose (R-50) achieves 62.2 AP kp on COCO test-dev; with R-101, 63.3 AP kp; multi-scale testing raises to 63.0 and 64.8 respectively.
The method runs with around 74-87 ms per image on COCO minival with ResNet backbones, comparable to or faster than Mask R-CNN under similar settings.
When combined with bounding-box detection, the framework can achieve 61.5 AP kp and 55.3 AP bb on minival, showing compatibility with a bounding-box branch.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。