QUICK REVIEW

[論文レビュー] MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

Grégory Rogez, Cordelia Schmid|arXiv (Cornell University)|Jul 7, 2016

Human Pose and Action Recognition参考文献 43被引用数 194

ひとこと要約

本研究は、MoCapデータを用いて実際の2Dポーズ画像を3Dポーズ注釈で補強する画像ベースの合成エンジンを導入し、全身の3Dポーズ推定のためのKクラスCNN分類器を訓練します。これは、管理下データセットで従来法を上回り、野外データにも有望性を示します。

ABSTRACT

This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D Motion Capture (MoCap) data. Given a candidate 3D pose our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms the state of the art in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for in-the-wild images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images.

研究の動機と目的

Address the lack of large-scale training data for 3D human pose estimation in the wild.
Develop an image-based synthesis engine that fuses MoCap 3D poses with real 2D-pose images to create synthetic training data with 3D annotations.
Train an end-to-end CNN to perform 3D pose estimation as a K-way pose classification problem.
Demonstrate that CNNs trained on synthetic+real data generalize to real in-the-wild images and outperform prior methods in controlled datasets.

提案手法

MoCap-guided mosaic構築を用いて、実画像から関節中心のパッチを選択・組み合わせて合成2Dポーズ画像を作成する。
2Dポーズ間のポーズ認識距離D_jを定義し、候補の3Dポーズと最もよく一致する各関節のマッチを見つける。
ピクセル毎の関節ソースマッチングを確率マップとして構築し、運動学的に制約されたモザイクを適用することで220x220の合成画像をレンダリングする。
シームを滑らかにする一方で胴体領域を保持する新規のポーズ認識ブレンディングステップを適用する。
エンドツーエンドのCNN分類器（AlexNet系アーキテクチャに基づく）を訓練し、3DポーズをK=5000のポーズクラスにクラスタリングし、これらのクラスに対する確率分布を出力する。クラス予測後に絶対位置と方向認識を評価する。

実験結果

リサーチクエスチョン

RQ1MoCap駆動の画像合成は、フォトリアリスティックな野外トレーニング画像と正確な3Dポーズ注釈を生成できるか？
RQ2合成データ+実データで訓練したCNNは、実データのみまたは合成データのみで訓練した場合と比較して3Dポーズ推定を改善するか？
RQ3ポーズクラス数(K)と合成データ量が野外での3Dポーズ性能に与える影響はどの程度か？
RQ4提案手法は、制御データセット（Human3.6M）と野外データセット（LSP）で最先端手法と比較してどうか？

主な発見

2D source	3D source	3D poses	H3.6M Abs Error (mm)	H3.6M Error (mm)	LSP 2D Error (pix)	LSP 3D Error (pix)
H3.6M	H3.6M	190,000	130.1	97.2	8.8	31.1
MPII+LSP	H3.6M	190,000	248.9	122.1	17.3	20.7
MPII+LSP	CMU	190,000	320.0	150.6	19.7	22.4
MPII+LSP	CMU	2,000,000	216.5	138.0	11.2	13.8

合成データのみで訓練すると実データのみより大きな改善を得られ、合成データと実データを組み合わせると最良の結果となる。
Human3.6MのP1プロトコルでは、合成データで訓練したリグレッサは101.9 mm Abs Errorおよび97.2 mm Errorを達成し、合成データで訓練したクラシファイアは97.2 mm Abs Errorおよび88.1 mm Errorを達成、合成+実データの組み合わせではクラシファイアが125.5 mm Abs Errorと88.1 mm Error（絶対整合を考慮）となる。
P2プロトコルでは、合成+実データを用いた場合クラシファイアは3D誤差87.3 mm（絶対）に対しリグレッサは121.2 mm。
LSPではMPII+LSPとCMUデータおよび200万の合成画像を用いると2Dポーズ誤差が11.2ピクセル、3Dポーズ誤差が138.0 mmまで低減し、H3.6Mの2D誤差は216.5 Abs? No: Abs 216.5 for 3D source combination; 138.0 mm 3D error in P2-like settingとなり、手法は2Dポーズ推定ベースラインと競合的である。
VGG-16系をファインチューニングすると、AlexNetベースの設定と比較して2Dポーズ誤差がさらに2.3ピクセル低減する。
定性的な結果は正しい3Dポーズ推定を示し、未知のポーズや左右・前後の混同に起因するいくつかの失敗ケースを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。