Skip to main content
QUICK REVIEW

[论文解读] 600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script

Haq Nawaz Malik|arXiv (Cornell University)|Jan 3, 2026
Handwritten Text Recognition Techniques被引用 0
一句话总结

Introduce a large-scale synthetic Kashmiri OCR dataset (600K-KS-OCR) with ~602k word images, RTL Kashmiri text, and multi-format ground-truths for CRNN, TrOCR, and ML pipelines.

ABSTRACT

This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.

研究动机与目标

  • Provide a large, authentic-scale synthetic dataset to advance Kashmiri OCR under low-resource conditions.
  • Capture the script’s calligraphic diversity using multiple traditional Kashmiri typefaces.
  • Improve robustness through extensive data augmentation and diverse backgrounds to simulate real-world documents.
  • Offer accessible data formats compatible with common OCR training frameworks to enable reproducible research.

提出的方法

  • Render ~602k word-level images at 256x64 using three Kashmiri typefaces (Afan Koshur Naksh, Nastaleeq, Nakash).
  • Apply a comprehensive augmentation pipeline (geometric, blur, noise, photometric, document-specific) to 60% of samples.
  • Synthesize mixed backgrounds spanning clean to aged textures to simulate real documents.
  • Distribute archives in ten partitions with formats CRNN, TrOCR, CSV, and JSONL for flexible integration.

实验结果

研究问题

  • RQ1How effective are large-scale synthetic Kashmiri word images for training OCR models (CRNN and Transformer-based) on Kashmiri script?
  • RQ2Do multiple Kashmiri typefaces and diverse backgrounds improve generalization to real-world Kashmiri documents?
  • RQ3What is the impact of structured data augmentation on OCR robustness for Kashmiri script?
  • RQ4Can the dataset formats facilitate efficient fine-tuning and benchmarking across OSS OCR pipelines?

主要发现

  • Approximately 602,000 word images are provided across ten archives totaling ~10.6 GB.
  • Images are 256x64 PNG with RTL Kashmiri text and ground-truth in CRNN, TrOCR, CSV, and JSONL formats.
  • Augmentation is applied to 60% of samples to simulate realistic document degradation; 40% remain clean.
  • Dataset is CC-BY-4.0 licensed and accessible via Hugging Face Datasets hub.
  • The data includes metadata detailing fonts used and generation settings.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。