QUICK REVIEW

[Paper Review] YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus

David Uthus, Garrett Tanzer|arXiv (Cornell University)|Jun 27, 2023

Hand Gesture Recognition Systems9 citations

TL;DR

Introduces YouTube-ASL, a large open-domain ASL-English parallel corpus mined from YouTube, and demonstrates state-of-the-art ASL-to-English translation on How2Sign with zero-shot results.

ABSTRACT

Machine learning for sign languages is bottlenecked by data. In this paper, we present YouTube-ASL, a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as large and has ~10x as many unique signers as the largest prior ASL dataset. We train baseline models for ASL to English translation on YouTube-ASL and evaluate them on How2Sign, where we achieve a new finetuned state of the art of 12.39 BLEU and, for the first time, report zero-shot results.

Motivation & Objective

Address data bottleneck in sign language ML by creating a large, diverse ASL-English parallel corpus from web data.
Show that open-domain mining with automatic tagging plus human filtering yields high-quality ASL captions and signer diversity.
Provide baseline ASL-to-English translation results to establish a benchmark and demonstrate zero-shot capabilities.

Proposed method

Two-step data collection: automatic tagging of YouTube videos likely to contain ASL, followed by human filtering for caption alignment and quality.
Preprocessing uses MediaPipe Holistic landmarks (hands, face, limited pose) as input features; 85 selected landmarks are normalized and field-downsampled to create 255-dim sequences.
A Transformer-based baseline model built on a T5 encoder-decoder architecture; input features are landmark embeddings to the encoder, with a 256-frame context window and 128-frame decoder window.
Training regimes include: training from How2Sign (H2S) only, YouTube-ASL (YT-ASL) only (zero-shot on How2Sign), mixed data (YT-ASL + H2S), and YouTube-ASL followed by finetuning on How2Sign.
Evaluation uses BLEU and BLEURT on How2Sign, with beam search (width=5); zero-shot and finetuned performance are reported.

Experimental results

Research questions

RQ1Can a large-scale, open-domain ASL-English corpus mined from YouTube improve ASL-to-English translation benchmarks?
RQ2What is the impact of pretraining on English text and of mixing YouTube-ASL data with How2Sign data on translation quality?
RQ3How does zero-shot performance on How2Sign compare to finetuned performance when using YouTube-ASL data?
RQ4Does the YouTube-ASL dataset provide improvements over prior ASL datasets in terms of size and signer diversity?

Key findings

YouTube-ASL comprises 11,093 ASL videos, ~984 hours, with 610,193 English captions (813 hours total captions) and 2519+ channels as signer proxies.
Finetuned state-of-the-art on How2Sign: 12.39 BLEU, surpassing prior SOTA of 8.03 BLEU.
Zero-shot BLEU of 3.95 demonstrates nontrivial out-of-domain translation capability.
Baseline with YouTube-ASL training alone yields lower scores; pretraining on English text and finetuning on How2Sign substantially boosts performance.
Mixing YouTube-ASL with How2Sign data and then finetuning yields the best results (36.35 BLEU1, 23.00 BLEU2, 16.13 BLEU3, 11.89 BLEURT; 12.39 BLEU overall when finetuned).
YouTube-ASL provides substantial signer variety and real-world domain coverage, though translations remain imperfect for deployment-ready quality.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.