[論文レビュー] RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video
RGB2Hands は、マルチタスク CNN と生成的な手モデル適合フレームワークを用いて、単一の RGB カメラから二つの相互作用する手の 3D 姿勢と表面ジオメトリをリアルタイムに追跡・再構築する方法を紹介します。深度センサなしで深度の曖昧さと遮蔽に対処します。
Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings (e.g., considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.
研究の動機と目的
- Address the challenge of marker-less 3D hand motion capture for two closely interacting hands using only monocular RGB input.
- Develop a robust, real-time pipeline that estimates global 3D pose and hand shape for both hands.
- Explicitly handle depth ambiguities and occlusions in RGB data during two-hand interaction tracking.
- Create training data and a benchmark (RGB2Hands) to enable learning-based two-hand RGB reconstruction.
提案手法
- Propose a multi-task CNN that predicts per-pixel left/right hand segmentation, dense vertex-to-image matchings to a 3D hand model, intra-hand relative depth maps, inter-hand distance, and occlusion-robust 2D keypoints.
- Fit a parametric 3D hand model (MANO) for both hands by minimizing a composite energy f(β,θ) = Φ(β,θ) + Ω(β,θ).
- Φ combines dense 2D fitting, silhouette, 2D keypoints, intra-hand depth, and inter-hand distance terms to align model to RGB data.
- Introduce intra-hand relative depth and inter-hand distance terms to resolve depth ambiguities from RGB during two-hand interaction.
- Use a Levenberg–Marquardt optimization with GPU-accelerated Jacobian evaluation to achieve real-time fitting (up to 10 LM iterations).
- Train on a mixed dataset of real (RGB-D) and physically simulated synthetic data that models interacting hands with varying shapes, guided by a MANO-based synthesis pipeline.
実験結果
リサーチクエスチョン
- RQ1Can a monocular RGB pipeline reconstruct accurate 3D pose and surface geometry for two closely interacting hands in real time?
- RQ2How can depth ambiguities in RGB be mitigated when tracking two hands in contact or near-contact scenarios?
- RQ3Does a multi-task CNN predicting segmentation, dense matching, depth cues, and keypoints provide robust targets for two-hand model fitting?
- RQ4How does RGB2Hands perform relative to depth-based methods and RGB methods not designed for two-hand interaction?
主な発見
- The method reconstructs 3D pose and shape for two interacting hands from monocular RGB in real time.
- A multi-task CNN predicting segmentation, dense surface matching, intra-hand depth, inter-hand distance, and 2D keypoints enables robust two-hand coupling in the fitting stage.
- A new energy formulation with five image-fitting terms (dense, silhouette, keypoints, intra-depth, inter-hand distance) enables coherent 3D fits from RGB data.
- A synthetic+real training regime with a physically accurate hand-pair simulator improves optimization toward realistic two-hand poses.
- RGB2Hands achieves substantial improvements over RGB-based methods not designed for two-hand interactions and performs comparably to depth-based real-time methods.
- A new RGB2Hands benchmark dataset provides real two-hand sequences with manual keypoints and synchronized depth for 3D evaluation.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。