QUICK REVIEW

[Paper Review] MediaPipe Hands: On-device Real-time Hand Tracking

Fan Zhang, Valentin Bazarevsky|arXiv (Cornell University)|Jun 18, 2020

Hand Gesture Recognition Systems10 references543 citations

TL;DR

Presents a real-time, on-device two-stage hand tracking pipeline (palm detector + hand landmark model) that predicts 21 2.5D hand landmarks from RGB input and runs efficiently on mobile GPUs. Open-sourced via MediaPipe for cross-platform deployment.

ABSTRACT

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

Motivation & Objective

Motivate AR/VR applications by enabling natural interaction through real-time hand tracking on commodity devices.
Develop a two-stage pipeline to detect palms and predict 21 2.5D hand landmarks from RGB input.
Achieve real-time mobile GPU inference with high prediction quality and cross-platform availability.

Proposed method

Two-stage pipeline: a BlazePalm-like palm detector provides a bounding box for each hand, followed by a hand landmark model that regresses 21 2.5D landmarks within the cropped palm region.
Palm detector designed for mobile real-time detection using square bounding boxes, encoder-decoder features, and focal loss to handle large scale variance.
Hand landmark model outputs: 21 landmarks (x, y, relative depth), a hand presence flag, and a handedness classification (left/right).
Tracking uses previous frame landmarks to crop the current frame, triggering the detector only when hands are lost or alignment confidence is low.
An auxiliary “hand presence” score helps recover from tracking failures by reinitializing the detector as needed.
Implemented within MediaPipe as a graph of modular Calculators with GPU acceleration and TensorFlow Lite backend.

Experimental results

Research questions

RQ1Can a two-stage on-device pipeline accurately estimate 21 2.5D hand landmarks from RGB input in real time on mobile devices?
RQ2How does leveraging previous frame landmarks for cropping affect detector frequency and overall throughput?
RQ3What is the impact of training data composition (real, synthetic, combined) on landmark accuracy and temporal stability?
RQ4How does the system perform across different devices (Android, iOS, desktop) and hardware backends?

Key findings

The hand landmark model achieves higher accuracy when trained with a combination of real-world and synthetic data (combined MSE 13.4% vs 16.1% for only real-world).
Real-time on-device inference is demonstrated on Pixel 3, Samsung S20, and iPhone 11 with a lightweight “Light,” a “Full,” and a “Heavy” model variant.
The “Full” model achieves 10.05 MSE with 16.1 ms on Pixel 3, 11.1 ms on iPhone11, and 5.3 ms on Samsung S20, balancing quality and speed.
Palm detector design choices (square boxes, encoder-decoder feature extractor, focal loss) and an ablation study improve detection robustness under occlusion and scale variance.
On-device inference uses TensorFlow Lite GPU backend, enabling real-time performance across platforms.
The pipeline outputs 21 landmarks, a hand presence probability, and handedness, enabling downstream AR/gesture applications.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.