QUICK REVIEW

[论文解读] Deep Structured Output Learning for Unconstrained Text Recognition

Max Jaderberg, Karen Simonyan|arXiv (Cornell University)|Dec 18, 2014

Handwritten Text Recognition Techniques参考文献 18被引用 91

一句话总结

本文提出了一种深度结构化输出学习框架，将卷积神经网络（CNN）与条件随机场（CRF）相结合，用于无约束文本识别。通过反向传播结构化损失联合训练字符预测器和N-gram预测器，该模型仅使用合成训练数据，就在无约束和词典约束基准上实现了最先进（SOTA）的准确率。

ABSTRACT

We develop a representation suitable for the unconstrained recognition of words in natural images: the general case of no fixed lexicon and unknown length. To this end we propose a convolutional neural network (CNN) based architecture which incorporates a Conditional Random Field (CRF) graphical model, taking the whole word image as a single input. The unaries of the CRF are provided by a CNN that predicts characters at each position of the output, while higher order terms are provided by another CNN that detects the presence of N-grams. We show that this entire model (CRF, character predictor, N-gram predictor) can be jointly optimised by back-propagating the structured output loss, essentially requiring the system to perform multi-task learning, and training uses purely synthetically generated data. The resulting model is a more accurate system on standard real-world text recognition benchmarks than character prediction alone, setting a benchmark for systems that have not been trained on a particular lexicon. In addition, our model achieves state-of-the-art accuracy in lexicon-constrained scenarios, without being specifically modelled for constrained recognition. To test the generalisation of our model, we also perform experiments with random alpha-numeric strings to evaluate the method when no visual language model is applicable.

研究动机与目标

开发一种能够泛化到未见过的非词典词汇的文本识别系统，且不依赖于固定词典。
解决在无约束场景中识别任意字母数字字符串和自然语言词汇的挑战。
通过联合建模字符级预测和高阶N-gram依赖关系，提升识别准确率。
仅使用合成数据端到端训练整个系统，避免对真实世界标注文本数据的依赖。
在无约束和词典约束场景下均实现具有竞争力的性能，且无需为任一场景专门设计网络架构。

提出的方法

模型使用CNN在每个字符位置预测字符概率，作为CRF中的单变量势能。
第二个CNN预测跨单词图像的N-gram（如二元组、三元组）存在情况，提供高阶CRF边势能。
CRF层结合单变量和边得分，通过结构化预测推断最可能的字符序列。
整个系统通过结构化输出损失的反向传播进行端到端训练，实现字符预测器和N-gram预测器的联合优化。
训练仅依赖于合成生成的单词图像，无需任何真实世界标注数据。
模型通过寻找使CRF得分最大的字符序列进行推理，确保预测的一致性。

实验结果

研究问题

RQ1深度学习模型是否能在不依赖固定词典的情况下实现无约束文本识别的高准确率？
RQ2联合建模字符级预测和N-gram模式在提升识别鲁棒性方面有多有效？
RQ3仅在合成数据上训练的模型是否能有效泛化到真实世界无约束文本识别基准？
RQ4与独立字符预测相比，引入结构化CRF建模是否能在无约束和约束设置下均提升性能？
RQ5在传统语言模型失效的非语言性、随机字母数字字符串上，该模型表现如何？

主要发现

JOINT模型在IC03测试集上无任何词典约束时达到89.6%的准确率，优于仅字符模型（85.9%），并为无词典识别设立了新基准。
在SVT数据集上，JOINT模型在无词典约束下达到71.7%的准确率，显著优于CHAR模型（68.0%），并在无约束设置下匹配或超过以往最先进方法。
当使用90,000词词典约束时，JOINT模型在IC03上达到93.1%的准确率，尽管未针对该特定词典进行训练，但仍与DICT模型（IC03-Full上为98.7%）性能相当。
在随机字母数字字符串的SynthRand数据集上，JOINT模型保持81.8%的准确率，证明其在N-gram语言模型失效时仍具鲁棒性。
定性示例显示，CRF边得分可纠正字符仅模型的错误预测（图4），模型成功识别出字符仅模型未能正确预测的词汇。
JOINT模型在无约束和词典约束场景下均实现了最先进性能，证明了其灵活性和泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。