Skip to main content
QUICK REVIEW

[Paper Review] MediaPipe Hands: On-device Real-time Hand Tracking

Fan Zhang, Valentin Bazarevsky|arXiv (Cornell University)|Jun 18, 2020
Hand Gesture Recognition Systems10 references543 citations
TL;DR

Presents a real-time, on-device two-stage hand tracking pipeline (palm detector + hand landmark model) that predicts 21 2.5D hand landmarks from RGB input and runs efficiently on mobile GPUs. Open-sourced via MediaPipe for cross-platform deployment.

ABSTRACT

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

Motivation & Objective

  • Motivate AR/VR applications by enabling natural interaction through real-time hand tracking on commodity devices.
  • Develop a two-stage pipeline to detect palms and predict 21 2.5D hand landmarks from RGB input.
  • Achieve real-time mobile GPU inference with high prediction quality and cross-platform availability.

Proposed method

  • Two-stage pipeline: a BlazePalm-like palm detector provides a bounding box for each hand, followed by a hand landmark model that regresses 21 2.5D landmarks within the cropped palm region.
  • Palm detector designed for mobile real-time detection using square bounding boxes, encoder-decoder features, and focal loss to handle large scale variance.
  • Hand landmark model outputs: 21 landmarks (x, y, relative depth), a hand presence flag, and a handedness classification (left/right).
  • Tracking uses previous frame landmarks to crop the current frame, triggering the detector only when hands are lost or alignment confidence is low.
  • An auxiliary “hand presence” score helps recover from tracking failures by reinitializing the detector as needed.
  • Implemented within MediaPipe as a graph of modular Calculators with GPU acceleration and TensorFlow Lite backend.

Experimental results

Research questions

  • RQ1Can a two-stage on-device pipeline accurately estimate 21 2.5D hand landmarks from RGB input in real time on mobile devices?
  • RQ2How does leveraging previous frame landmarks for cropping affect detector frequency and overall throughput?
  • RQ3What is the impact of training data composition (real, synthetic, combined) on landmark accuracy and temporal stability?
  • RQ4How does the system perform across different devices (Android, iOS, desktop) and hardware backends?

Key findings

  • The hand landmark model achieves higher accuracy when trained with a combination of real-world and synthetic data (combined MSE 13.4% vs 16.1% for only real-world).
  • Real-time on-device inference is demonstrated on Pixel 3, Samsung S20, and iPhone 11 with a lightweight “Light,” a “Full,” and a “Heavy” model variant.
  • The “Full” model achieves 10.05 MSE with 16.1 ms on Pixel 3, 11.1 ms on iPhone11, and 5.3 ms on Samsung S20, balancing quality and speed.
  • Palm detector design choices (square boxes, encoder-decoder feature extractor, focal loss) and an ablation study improve detection robustness under occlusion and scale variance.
  • On-device inference uses TensorFlow Lite GPU backend, enabling real-time performance across platforms.
  • The pipeline outputs 21 landmarks, a hand presence probability, and handedness, enabling downstream AR/gesture applications.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.