[Paper Review] YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus
Introduces YouTube-ASL, a large open-domain ASL-English parallel corpus mined from YouTube, and demonstrates state-of-the-art ASL-to-English translation on How2Sign with zero-shot results.
Machine learning for sign languages is bottlenecked by data. In this paper, we present YouTube-ASL, a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as large and has ~10x as many unique signers as the largest prior ASL dataset. We train baseline models for ASL to English translation on YouTube-ASL and evaluate them on How2Sign, where we achieve a new finetuned state of the art of 12.39 BLEU and, for the first time, report zero-shot results.
Motivation & Objective
- Address data bottleneck in sign language ML by creating a large, diverse ASL-English parallel corpus from web data.
- Show that open-domain mining with automatic tagging plus human filtering yields high-quality ASL captions and signer diversity.
- Provide baseline ASL-to-English translation results to establish a benchmark and demonstrate zero-shot capabilities.
Proposed method
- Two-step data collection: automatic tagging of YouTube videos likely to contain ASL, followed by human filtering for caption alignment and quality.
- Preprocessing uses MediaPipe Holistic landmarks (hands, face, limited pose) as input features; 85 selected landmarks are normalized and field-downsampled to create 255-dim sequences.
- A Transformer-based baseline model built on a T5 encoder-decoder architecture; input features are landmark embeddings to the encoder, with a 256-frame context window and 128-frame decoder window.
- Training regimes include: training from How2Sign (H2S) only, YouTube-ASL (YT-ASL) only (zero-shot on How2Sign), mixed data (YT-ASL + H2S), and YouTube-ASL followed by finetuning on How2Sign.
- Evaluation uses BLEU and BLEURT on How2Sign, with beam search (width=5); zero-shot and finetuned performance are reported.
Experimental results
Research questions
- RQ1Can a large-scale, open-domain ASL-English corpus mined from YouTube improve ASL-to-English translation benchmarks?
- RQ2What is the impact of pretraining on English text and of mixing YouTube-ASL data with How2Sign data on translation quality?
- RQ3How does zero-shot performance on How2Sign compare to finetuned performance when using YouTube-ASL data?
- RQ4Does the YouTube-ASL dataset provide improvements over prior ASL datasets in terms of size and signer diversity?
Key findings
- YouTube-ASL comprises 11,093 ASL videos, ~984 hours, with 610,193 English captions (813 hours total captions) and 2519+ channels as signer proxies.
- Finetuned state-of-the-art on How2Sign: 12.39 BLEU, surpassing prior SOTA of 8.03 BLEU.
- Zero-shot BLEU of 3.95 demonstrates nontrivial out-of-domain translation capability.
- Baseline with YouTube-ASL training alone yields lower scores; pretraining on English text and finetuning on How2Sign substantially boosts performance.
- Mixing YouTube-ASL with How2Sign data and then finetuning yields the best results (36.35 BLEU1, 23.00 BLEU2, 16.13 BLEU3, 11.89 BLEURT; 12.39 BLEU overall when finetuned).
- YouTube-ASL provides substantial signer variety and real-world domain coverage, though translations remain imperfect for deployment-ready quality.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.