QUICK REVIEW

[论文解读] COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Andreas Veit, Tomáš Matera|arXiv (Cornell University)|Jan 26, 2016

Handwritten Text Recognition Techniques被引用 237

一句话总结

COCO-Text 引入了一个大规模、注释丰富的数据集，用于在自然图像中检测和识别文本，注释扩展到不仅是转录，还包括可读性、文字样式和文本类型，并在该数据上评估最先进的照片OCR方法。

ABSTRACT

This paper describes the COCO-Text dataset. In recent years large-scale datasets like SUN and Imagenet drove the advancement of scene understanding and object recognition. The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of complex everyday scenes. The images were not collected with text in mind and thus contain a broad variety of text instances. To reflect the diversity of text in natural scenes, we annotate text with (a) location in terms of a bounding box, (b) fine-grained classification into machine printed text and handwritten text, (c) classification into legible and illegible text, (d) script of the text and (e) transcriptions of legible text. The dataset contains over 173k text annotations in over 63k images. We provide a statistical analysis of the accuracy of our annotations. In addition, we present an analysis of three leading state-of-the-art photo Optical Character Recognition (OCR) approaches on our dataset. While scene text detection and recognition enjoys strong advances in recent years, we identify significant shortcomings motivating future work.

研究动机与目标

提供一个大规模、多样化的自然场景文本数据集，以推动场景文本检测与识别。
使用边框框注和细粒度属性对文本实例进行标注（可读性、机器印刷 vs 手写、书写脚本）。
评估该数据集上当前最先进的照片级OCR方法，并识别现实世界应用中尚存的差距。

提出的方法

使用多阶段众包流程对 MS COCO 图像中的文本区域进行注释。
结合来自多种照片OCR系统和人类标注者的OCR输出，以检测并细化文本区域。
按可读性、书写脚本和类型（机器打印、手写、其他）对文本区域进行分类。
为可读文本收集转录，对不可读文本在转录迭代中进行标记。
在保留的验证集上，使用 ICDAR 风格的指标评估检测、转录和端到端性能。

实验结果

研究问题

RQ1在基于大规模 MS COCO 数据集进行注释时，自然场景中的文本有多大程度的多样性？
RQ2众包工作者与OCR系统是否能够在自然图像中可靠地检测并对广泛的文本类型和可读性水平进行分类？
RQ3对无约束场景文本，最先进的照片OCR方法目前存在哪些局限性，特别是在不可读文本和检测召回率方面？
RQ4上下文（COCO 中的对象）与自然图像中文本的存在之间有何关系？
RQ5为了在现实世界情景中达到稳健的端到端文本识别，需要哪些改进？

主要发现

COCO-Text 包含 63,686 张图像，具有 173,589 处文本注释，覆盖边界框和细粒度属性。
大约 50% 的 COCO-Text 图像不含文本，而每张图像平均有 2.73 处文本实例（在有文本的图像中为 5.46）。
文本属性包括可读性（60.3% 可读，39.7% 不可读）、类型（机器打印 vs 手写），以及书写脚本（英语 vs 非英语）。
三大领先的照片OCR系统在检测方面达到高精度但召回率低，尤其是在不可读文本方面，凸显出大量尚存的差距。
众包标注者检测了所有文本区域的 57%，在可读文本上成功率为 84%，在不可读文本上为 39%。
端到端识别结果仅限于可读的机器打印和手写英语文本，凸显数据集丰富性与当前OCR能力之间的差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。