QUICK REVIEW

[论文解读] Text Detection and Recognition in the Wild: A Review

Zobeir Raisi, Mohamed A. Naiel|arXiv (Cornell University)|Jun 8, 2020

Handwritten Text Recognition Techniques参考文献 181被引用 26

一句话总结

本文综述了近年来在自然、非受限环境（即“野外文本”）中基于深度学习的场景文本检测与识别方法。在统一框架下对最先进（SOTA）的预训练模型在具有挑战性的基准数据集上进行了评估，揭示了在真实世界失真条件下的关键性能差距。研究发现，混合检测模型（如 PMTD）和基于注意力机制的识别网络（如 ASTER、CLOVA）表现出更优的鲁棒性，而遮挡、复杂字体和特殊字符仍是持续存在的挑战。

ABSTRACT

Detection and recognition of text in natural images are two main problems in the field of computer vision that have a wide variety of applications in analysis of sports videos, autonomous driving, industrial automation, to name a few. They face common challenging problems that are factors in how text is represented and affected by several environmental conditions. The current state-of-the-art scene text detection and/or recognition methods have exploited the witnessed advancement in deep learning architectures and reported a superior accuracy on benchmark datasets when tackling multi-resolution and multi-oriented text. However, there are still several remaining challenges affecting text in the wild images that cause existing methods to underperform due to there models are not able to generalize to unseen data and the insufficient labeled data. Thus, unlike previous surveys in this field, the objectives of this survey are as follows: first, offering the reader not only a review on the recent advancement in scene text detection and recognition, but also presenting the results of conducting extensive experiments using a unified evaluation framework that assesses pre-trained models of the selected methods on challenging cases, and applies the same evaluation criteria on these techniques. Second, identifying several existing challenges for detecting or recognizing text in the wild images, namely, in-plane-rotation, multi-oriented and multi-resolution text, perspective distortion, illumination reflection, partial occlusion, complex fonts, and special characters. Finally, the paper also presents insight into the potential research directions in this field to address some of the mentioned challenges that are still encountering scene text detection and recognition techniques.

研究动机与目标

提供对基于深度学习的场景文本检测与识别近期进展的详细综述。
在多个基准数据集上，通过统一的实验框架评估最先进预训练模型的性能。
识别在真实世界条件下（如遮挡、透视失真和复杂字体）检测与识别文本时存在的持久性挑战。
提出未来研究方向，以解决文本在野外应用中的泛化差距和数据稀缺问题。

提出的方法

采用统一的评估框架，在 ICDAR13、ICDAR15 和 COCO-Text 数据集上进行了大量实验，对比预训练模型。
在所有数据集上使用一致的标注真值和评估指标，对检测与识别模型进行评估。
将检测方法分类为基于分割的方法（如 PixelLink、PSENET、PAN）、混合回归-分割方法（如 PMTD），以及基于字符级别的检测方法（如 CRAFT）。
根据架构类型评估识别模型：基于 CTC 的模型（如 CRNN、STARNET、ROSETTA）与基于注意力机制的模型（如 ASTER、CLOVA、Baek2019STR）。
分析模型在多挑战场景下的表现，包括平面内旋转、多方向文本和部分遮挡。
提出集成 BERT 风格的语言模型和风格迁移技术，以提升对遮挡和复杂字体的鲁棒性。

实验结果

研究问题

RQ1在多样化的现实世界基准上，采用统一评估协议时，最先进基于深度学习的场景文本检测模型表现如何？
RQ2哪些检测与识别架构在自然图像中面对多方向、多分辨率和失真文本时表现出更优的鲁棒性？
RQ3当前模型在面对遮挡、复杂字体和特殊字符时的关键失效模式是什么？
RQ4在未进行微调的情况下，仅在合成数据上训练的识别模型在真实世界非受限图像上的泛化能力如何？
RQ5为增强野外场景下文本识别的泛化能力和鲁棒性，需要哪些架构与训练改进？

主要发现

基于分割的方法（如 PixelLink、PSENET、PAN）在检测不规则形状文本方面表现出更优的鲁棒性。
混合回归与分割模型（如 PMTD）在 ICDAR13、ICDAR15 和 COCO-Text 上均取得了最高的 H-mean 分数，尤其在多方向文本检测中表现突出。
基于字符级别的检测模型（如 CRAFT）由于具备细粒度定位能力，在检测不规则和弯曲文本方面表现优异。
当多种挑战（如遮挡 + 模糊 + 透视失真）同时出现时，所有评估方法的性能均显著下降。
基于注意力机制的识别模型（如 ASTER、CLOVA）优于基于 CTC 的模型（如 CRNN、STARNET），因其具备更优的特征提取与空间校正机制。
仅在合成数据上训练的识别模型无需微调即可泛化到真实世界图像，表明在某些情况下具备强大的领域泛化潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。