QUICK REVIEW

[论文解读] 600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script

Haq Nawaz Malik|arXiv (Cornell University)|Jan 3, 2026

Handwritten Text Recognition Techniques被引用 0

一句话总结

Introduce a large-scale synthetic Kashmiri OCR dataset (600K-KS-OCR) with ~602k word images, RTL Kashmiri text, and multi-format ground-truths for CRNN, TrOCR, and ML pipelines.

ABSTRACT

This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.

研究动机与目标

Provide a large, authentic-scale synthetic dataset to advance Kashmiri OCR under low-resource conditions.
Capture the script’s calligraphic diversity using multiple traditional Kashmiri typefaces.
Improve robustness through extensive data augmentation and diverse backgrounds to simulate real-world documents.
Offer accessible data formats compatible with common OCR training frameworks to enable reproducible research.

提出的方法

Render ~602k word-level images at 256x64 using three Kashmiri typefaces (Afan Koshur Naksh, Nastaleeq, Nakash).
Apply a comprehensive augmentation pipeline (geometric, blur, noise, photometric, document-specific) to 60% of samples.
Synthesize mixed backgrounds spanning clean to aged textures to simulate real documents.
Distribute archives in ten partitions with formats CRNN, TrOCR, CSV, and JSONL for flexible integration.

实验结果

研究问题

RQ1How effective are large-scale synthetic Kashmiri word images for training OCR models (CRNN and Transformer-based) on Kashmiri script?
RQ2Do multiple Kashmiri typefaces and diverse backgrounds improve generalization to real-world Kashmiri documents?
RQ3What is the impact of structured data augmentation on OCR robustness for Kashmiri script?
RQ4Can the dataset formats facilitate efficient fine-tuning and benchmarking across OSS OCR pipelines?

主要发现

Approximately 602,000 word images are provided across ten archives totaling ~10.6 GB.
Images are 256x64 PNG with RTL Kashmiri text and ground-truth in CRNN, TrOCR, CSV, and JSONL formats.
Augmentation is applied to 60% of samples to simulate realistic document degradation; 40% remain clean.
Dataset is CC-BY-4.0 licensed and accessible via Hugging Face Datasets hub.
The data includes metadata detailing fonts used and generation settings.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。