[论文解读] 600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script
Introduce a large-scale synthetic Kashmiri OCR dataset (600K-KS-OCR) with ~602k word images, RTL Kashmiri text, and multi-format ground-truths for CRNN, TrOCR, and ML pipelines.
This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.
研究动机与目标
- Provide a large, authentic-scale synthetic dataset to advance Kashmiri OCR under low-resource conditions.
- Capture the script’s calligraphic diversity using multiple traditional Kashmiri typefaces.
- Improve robustness through extensive data augmentation and diverse backgrounds to simulate real-world documents.
- Offer accessible data formats compatible with common OCR training frameworks to enable reproducible research.
提出的方法
- Render ~602k word-level images at 256x64 using three Kashmiri typefaces (Afan Koshur Naksh, Nastaleeq, Nakash).
- Apply a comprehensive augmentation pipeline (geometric, blur, noise, photometric, document-specific) to 60% of samples.
- Synthesize mixed backgrounds spanning clean to aged textures to simulate real documents.
- Distribute archives in ten partitions with formats CRNN, TrOCR, CSV, and JSONL for flexible integration.
实验结果
研究问题
- RQ1How effective are large-scale synthetic Kashmiri word images for training OCR models (CRNN and Transformer-based) on Kashmiri script?
- RQ2Do multiple Kashmiri typefaces and diverse backgrounds improve generalization to real-world Kashmiri documents?
- RQ3What is the impact of structured data augmentation on OCR robustness for Kashmiri script?
- RQ4Can the dataset formats facilitate efficient fine-tuning and benchmarking across OSS OCR pipelines?
主要发现
- Approximately 602,000 word images are provided across ten archives totaling ~10.6 GB.
- Images are 256x64 PNG with RTL Kashmiri text and ground-truth in CRNN, TrOCR, CSV, and JSONL formats.
- Augmentation is applied to 60% of samples to simulate realistic document degradation; 40% remain clean.
- Dataset is CC-BY-4.0 licensed and accessible via Hugging Face Datasets hub.
- The data includes metadata detailing fonts used and generation settings.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。