Skip to main content
QUICK REVIEW

[論文レビュー] RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video

Jiayi Wang, Franziska Mueller|arXiv (Cornell University)|Jun 22, 2021
Human Pose and Action Recognition参考文献 30被引用数 42
ひとこと要約

RGB2Hands は、マルチタスク CNN と生成的な手モデル適合フレームワークを用いて、単一の RGB カメラから二つの相互作用する手の 3D 姿勢と表面ジオメトリをリアルタイムに追跡・再構築する方法を紹介します。深度センサなしで深度の曖昧さと遮蔽に対処します。

ABSTRACT

Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings (e.g., considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.

研究の動機と目的

  • Address the challenge of marker-less 3D hand motion capture for two closely interacting hands using only monocular RGB input.
  • Develop a robust, real-time pipeline that estimates global 3D pose and hand shape for both hands.
  • Explicitly handle depth ambiguities and occlusions in RGB data during two-hand interaction tracking.
  • Create training data and a benchmark (RGB2Hands) to enable learning-based two-hand RGB reconstruction.

提案手法

  • Propose a multi-task CNN that predicts per-pixel left/right hand segmentation, dense vertex-to-image matchings to a 3D hand model, intra-hand relative depth maps, inter-hand distance, and occlusion-robust 2D keypoints.
  • Fit a parametric 3D hand model (MANO) for both hands by minimizing a composite energy f(β,θ) = Φ(β,θ) + Ω(β,θ).
  • Φ combines dense 2D fitting, silhouette, 2D keypoints, intra-hand depth, and inter-hand distance terms to align model to RGB data.
  • Introduce intra-hand relative depth and inter-hand distance terms to resolve depth ambiguities from RGB during two-hand interaction.
  • Use a Levenberg–Marquardt optimization with GPU-accelerated Jacobian evaluation to achieve real-time fitting (up to 10 LM iterations).
  • Train on a mixed dataset of real (RGB-D) and physically simulated synthetic data that models interacting hands with varying shapes, guided by a MANO-based synthesis pipeline.

実験結果

リサーチクエスチョン

  • RQ1Can a monocular RGB pipeline reconstruct accurate 3D pose and surface geometry for two closely interacting hands in real time?
  • RQ2How can depth ambiguities in RGB be mitigated when tracking two hands in contact or near-contact scenarios?
  • RQ3Does a multi-task CNN predicting segmentation, dense matching, depth cues, and keypoints provide robust targets for two-hand model fitting?
  • RQ4How does RGB2Hands perform relative to depth-based methods and RGB methods not designed for two-hand interaction?

主な発見

  • The method reconstructs 3D pose and shape for two interacting hands from monocular RGB in real time.
  • A multi-task CNN predicting segmentation, dense surface matching, intra-hand depth, inter-hand distance, and 2D keypoints enables robust two-hand coupling in the fitting stage.
  • A new energy formulation with five image-fitting terms (dense, silhouette, keypoints, intra-depth, inter-hand distance) enables coherent 3D fits from RGB data.
  • A synthetic+real training regime with a physically accurate hand-pair simulator improves optimization toward realistic two-hand poses.
  • RGB2Hands achieves substantial improvements over RGB-based methods not designed for two-hand interactions and performs comparably to depth-based real-time methods.
  • A new RGB2Hands benchmark dataset provides real two-hand sequences with manual keypoints and synchronized depth for 3D evaluation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。