QUICK REVIEW

[論文レビュー] RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video

Jiayi Wang, Franziska Mueller|arXiv (Cornell University)|Jun 22, 2021

Human Pose and Action Recognition参考文献 30被引用数 42

ひとこと要約

RGB2Hands は、マルチタスク CNN と生成的な手モデル適合フレームワークを用いて、単一の RGB カメラから二つの相互作用する手の 3D 姿勢と表面ジオメトリをリアルタイムに追跡・再構築する方法を紹介します。深度センサなしで深度の曖昧さと遮蔽に対処します。

ABSTRACT

Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings (e.g., considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.

研究の動機と目的

Address the challenge of marker-less 3D hand motion capture for two closely interacting hands using only monocular RGB input.
Develop a robust, real-time pipeline that estimates global 3D pose and hand shape for both hands.
Explicitly handle depth ambiguities and occlusions in RGB data during two-hand interaction tracking.
Create training data and a benchmark (RGB2Hands) to enable learning-based two-hand RGB reconstruction.

提案手法

Propose a multi-task CNN that predicts per-pixel left/right hand segmentation, dense vertex-to-image matchings to a 3D hand model, intra-hand relative depth maps, inter-hand distance, and occlusion-robust 2D keypoints.
Fit a parametric 3D hand model (MANO) for both hands by minimizing a composite energy f(β,θ) = Φ(β,θ) + Ω(β,θ).
Φ combines dense 2D fitting, silhouette, 2D keypoints, intra-hand depth, and inter-hand distance terms to align model to RGB data.
Introduce intra-hand relative depth and inter-hand distance terms to resolve depth ambiguities from RGB during two-hand interaction.
Use a Levenberg–Marquardt optimization with GPU-accelerated Jacobian evaluation to achieve real-time fitting (up to 10 LM iterations).
Train on a mixed dataset of real (RGB-D) and physically simulated synthetic data that models interacting hands with varying shapes, guided by a MANO-based synthesis pipeline.

実験結果

リサーチクエスチョン

RQ1Can a monocular RGB pipeline reconstruct accurate 3D pose and surface geometry for two closely interacting hands in real time?
RQ2How can depth ambiguities in RGB be mitigated when tracking two hands in contact or near-contact scenarios?
RQ3Does a multi-task CNN predicting segmentation, dense matching, depth cues, and keypoints provide robust targets for two-hand model fitting?
RQ4How does RGB2Hands perform relative to depth-based methods and RGB methods not designed for two-hand interaction?

主な発見

The method reconstructs 3D pose and shape for two interacting hands from monocular RGB in real time.
A multi-task CNN predicting segmentation, dense surface matching, intra-hand depth, inter-hand distance, and 2D keypoints enables robust two-hand coupling in the fitting stage.
A new energy formulation with five image-fitting terms (dense, silhouette, keypoints, intra-depth, inter-hand distance) enables coherent 3D fits from RGB data.
A synthetic+real training regime with a physically accurate hand-pair simulator improves optimization toward realistic two-hand poses.
RGB2Hands achieves substantial improvements over RGB-based methods not designed for two-hand interactions and performs comparably to depth-based real-time methods.
A new RGB2Hands benchmark dataset provides real two-hand sequences with manual keypoints and synchronized depth for 3D evaluation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。