[论文解读] ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging
ARCADE 提供来自广播流的城市级细粒度阿拉伯语方言语料库,具有丰富的元数据和手动注释,覆盖情感、语音类型、方言类别和质量等信息。它支持细粒度方言归因与多任务分析,覆盖 19 个国家的 58 个城市。
The Arabic language is characterized by a rich tapestry of regional dialects that differ substantially in phonetics and lexicon, reflecting the geographic and cultural diversity of its speakers. Despite the availability of many multi-dialect datasets, mapping speech to fine-grained dialect sources, such as cities, remains underexplored. We present ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first Arabic speech dataset designed explicitly with city-level dialect granularity. The corpus comprises Arabic radio speech collected from streaming services across the Arab world. Our data pipeline captures 30-second segments from verified radio streams, encompassing both Modern Standard Arabic (MSA) and diverse dialectal speech. To ensure reliability, each clip was annotated by one to three native Arabic reviewers who assigned rich metadata, including emotion, speech type, dialect category, and a validity flag for dialect identification tasks. The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries. These fine-grained annotations enable robust multi-task learning, serving as a benchmark for city-level dialect tagging. We detail the data collection methodology, assess audio quality, and provide a comprehensive analysis of label distributions. The dataset is available on: https://huggingface.co/datasets/riotu-lab/ARCADE-full
研究动机与目标
- Motivate fine-grained, city-level dialect labeling beyond country or region-level classifications.
- Describe a data collection pipeline that harvests radio speech from Arab city streams.
- Provide rich, manually verified annotations (emotion, speech type, dialect category, quality) for robust multi-task learning.
- Analyze label distributions, audio quality, and geographic coverage to inform modeling decisions.
- Offer a reusable protocol and open data to catalyze city-level dialect attribution research.
提出的方法
- Design and implement a radio-stream recording pipeline that collects 30-second monologue segments from Arab city streams.
- Annotate each clip with emotion, speech type, dialect category, keep/skip decision, and annotator confidence using a custom Gradio-based interface.
- Ensure geographic granularity by recording from 58 cities across 19 countries and enforcing a minimum of 10 kept recordings per city.
- Provide metadata fields including country, city, MSA/dialect, annotator, timestamp, and duration for reproducibility.
- Perform technical validation through inter-annotator agreement and acoustic quality metrics (SNR, silence ratio, dynamic range, spectral centroid).
- Make the full dataset available on Hugging Face Datasets under CC BY 4.0 for non-commercial academic use.
实验结果
研究问题
- RQ1Can city-level dialect labeling be realized reliably from radio-sourced Arabic speech?
- RQ2What are the distributions of dialect vs. MSA, emotion, and speech type across fine-grained cities in the Arab world?
- RQ3How does audio quality vary geographically, and what implications does it have for dialect identification models?
- RQ4How reliable are manual annotations for keep/skip decisions, dialect classification, and emotion in a radio speech corpus?
- RQ5Can ARCADE enable multi-task learning that jointly models dialect category, emotion, and sub-regional origin?
主要发现
- The dataset comprises 6,907 annotations and 3,790 unique audio clips from 58 cities in 19 countries.
- 65.7% of clips were kept for dialect identification after manual review; 34.3% were skipped due to Quran recitation, music, or crosstalk.
- Dialects account for 41% of clips, MSA 21%, mixed 18%, and not applicable 20%.
- Emotion annotations are dominated by Neutral at 87.8%, with other emotions underrepresented.
- Inter-annotator agreement shows Keep/Skip with 91.76% raw agreement and Cohen’s Kappa 0.507; MSA/Dialect 83.16% and 0.310; Emotion 90.53% and 0.179; Type 87.71% and 0.586.
- Retained samples exhibit higher audio quality (mean SNR 15.25 dB) than skipped samples (9.36 dB).
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。