QUICK REVIEW

[論文レビュー] Learning Delicate Local Representations for Multi-Person Pose Estimation

Yuanhao Cai, Zhicheng Wang|arXiv (Cornell University)|Mar 9, 2020

Human Pose and Action Recognition参考文献 32被引用数 37

ひとこと要約

この論文は、局所的表現を繊細に学習するための効率的な同一レベルの特徴融合を実現する Residual Steps Network (RSN) を提案し、局所/グローバル出力特徴のバランスを取る Pose Refine Machine (PRM) を併用することで、追加データなしで COCO および MPII において最先端の結果を達成しています。

ABSTRACT

In this paper, we propose a novel method called Residual Steps Network (RSN). RSN aggregates features with the same spatial size (Intra-level features) efficiently to obtain delicate local representations, which retain rich low-level spatial information and result in precise keypoint localization. Additionally, we observe the output features contribute differently to final performance. To tackle this problem, we propose an efficient attention mechanism - Pose Refine Machine (PRM) to make a trade-off between local and global representations in output features and further refine the keypoint locations. Our approach won the 1st place of COCO Keypoint Challenge 2019 and achieves state-of-the-art results on both COCO and MPII benchmarks, without using extra training data and pretrained model. Our single model achieves 78.6 on COCO test-dev, 93.0 on MPII test dataset. Ensembled models achieve 79.2 on COCO test-dev, 77.1 on COCO test-challenge dataset. The source code is publicly available for further research at https://github.com/caiyuanhao1998/RSN/

研究の動機と目的

Motivation: same feature level 内の繊細な局所空間情報を保持して、キーポイントの局在精度を向上させる。
Goal: intra-level feature fusion を提案し、マルチ-person pose estimation のためのよりリッチな局所表現を学習する。
Aim: 出力特徴の再重み付けメカニズム（PRM）を設計し、局所情報とグローバル情報のバランスを取ってキーポイント精度を向上させる。
Demonstrate that RSN+PRM achieves state-of-the-art results on COCO and MPII without extra data or pretrained models.

提案手法

Residual Steps Block (RSB) 内部で dense element-wise sums による intra-level features の融合を行う Residual Steps Network (RSN) を提案。
RSB は特徴を 4 つの分岐に分割し、conv1x1 と incremental conv3x3 を適用し、dense 接続で統合して受容野のカバレッジを広げる（最大で 15）。
Pose Refine Machine (PRM) を出力時のアテンションモジュールとして導入し、チャンネルと空間のアテンション経路を備えたマルチパス設計を用いて局所的/グローバルな表現を再バランスさせる。
PRM では global pooling path（チャンネル方向）と depth-wise 9x9 path（空間）に加え、アイデンティティパスを用いて f_out = K(f_in) ⊗ (1 + β ⊗ α) を計算する。
Train and evaluate RSN+PRM on COCO and MPII, comparing against ResNet, Res2Net, DenseNet baselines and HRNet family.
Show that RSN achieves better performance than baselines at similar GFLOPs and provides efficient, accurate keypoint localization.

実験結果

リサーチクエスチョン

RQ1Can intra-level fusion within the same resolution improve delicate local representations for better keypoint localization?
RQ2Does an attention-based reweighting (PRM) properly trade off local vs. global features to boost pose estimation accuracy?
RQ3How does RSN compare to DenseNet, Res2Net, OSNet in terms of accuracy, efficiency, and localization quality on COCO and MPII?
RQ4Is PRM beneficial across single-stage and multi-stage architectures and when replacing standard attention modules like SE-CBAM?
RQ5What are the performance and speed trade-offs of RSN relative to HRNet across CPU/GPU in pose estimation tasks?

主な発見

backbone	input size	AP	Δ	GFLOPs
ResNet-18	256 × 192	70.7	0	2.3
Res2Net-18	256 × 192	71.3	+0.6	2.2
Baseline1-18	256 × 192	72.9	+2.1	2.5
Baseline2-18	256 × 192	72.1	+1.4	2.5
RSN-18	256 × 192	73.6	+2.9	2.5
ResNet-50	256 × 192	72.2	0	4.6
Res2Net-50	256 × 192	72.8	+0.6	4.5
Baseline1-50	256 × 192	73.7	+1.5	6.4
Baseline2-50	256 × 192	72.7	+0.5	6.4
RSN-50	256 × 192	74.7	+2.5	6.4
ResNet-101	256 × 192	73.2	0	7.5
Res2Net-101	256 × 192	73.9	+0.7	7.5
RSN-101	256 × 192	75.8	+2.5	11.5
4 × ResNet-50	256 × 192	76.8	0	20.6
4 × Res2Net-50	256 × 192	77.0	+0.2	20.1
4 × RSN-50	256 × 192	78.6	+1.8	27.5
4 × RSN-50	384 × 288	79.2	+1.7	61.9

RSN は同等の GFLOPs で ResNet および Res2Net より一貫して AP を改善（例: RSN-18 は ResNet-18 に対して +2.9 AP、RSN-50 は ResNet-50 に対して +2.5 AP）。
RSN はモデル容量が増加しても効率を維持し、DenseNet や Res2Net を上回る AP を大きな GFLOPs で維持する。
PRM は単一段階および多段階のネットワークを改善し、非アテンションのベースラインに対して AP の利得をもたらす（例: ResNet-18 with PRM +1.5 AP）。
COCO test-dev では RSN-50 with 4x RSN-50 が 78.0 AP（単一モデル）で、RSN-50 アンサンブル時には 384x288 の入力で 79.2 AP、事前学習済みバックボーン不要。
MPII では RSN が 4x RSN-50 で最先端の 93.0% PCKh@0.5 平均を達成。
RSN は HRNet と同等の精度で推論をより速く、GPU での PPS が高く、CPU パフォーマンスも優れている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。