QUICK REVIEW

[论文解读] Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

Ian Goodfellow, Yaroslav Bulatov|arXiv (Cornell University)|Dec 20, 2013

Handwritten Text Recognition Techniques被引用 435

一句话总结

本文提出一种深度卷积神经网络，可端到端地直接从街景图像中定位、分割并识别多数字，其在单个数字识别任务上达到97.84%的准确率，在完整街道路牌识别任务上超过96%，在最困难的reCAPTCHA谜题上达到99.8%的准确率，证明了在关键任务上达到人类水平的表现。

ABSTRACT

Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural network that operates directly on the image pixels. We employ the DistBelief implementation of deep neural networks in order to train large, distributed neural networks on high quality images. We find that the performance of this approach increases with the depth of the convolutional network, with the best performance occurring in the deepest architecture we trained, with eleven hidden layers. We evaluate this approach on the publicly available SVHN dataset and achieve over $96\%$ accuracy in recognizing complete street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art, achieving $97.84\%$ accuracy. We also evaluate this approach on an even more challenging dataset generated from Street View imagery containing several tens of millions of street number annotations and achieve over $90\%$ accuracy. To further explore the applicability of the proposed system to broader text recognition tasks, we apply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of the most secure reverse turing tests that uses distorted text to distinguish humans from bots. We report a $99.8\%$ accuracy on the hardest category of reCAPTCHA. Our evaluations on both tasks indicate that at specific operating thresholds, the performance of the proposed system is comparable to, and in some cases exceeds, that of human operators.

研究动机与目标

开发一种统一的端到端系统，用于在非受限街景图像中实现多数字的定位、分割与识别。
通过消除定位与分割的独立阶段，改进传统基于流水线的方法。
在真实世界数据集（包括SVHN数据集和包含数千万个标注的大规模街景图像数据集）上评估模型性能。
通过将其应用于reCAPTCHA谜题，评估模型在合成扭曲文本上的泛化能力。
确定深度神经网络架构是否能在复杂真实世界OCR任务上实现人类水平表现。

提出的方法

在原始像素数据上端到端训练一个具有十一个隐藏层的深度卷积神经网络，直接将图像映射到数字序列。
网络采用一种新型输出层，将序列建模为条件独立的数字，并使用概率框架进行序列预测。
使用DistBelief框架进行训练，以在多台机器上扩展大规模分布式神经网络。
模型利用分层特征学习机制，其中浅层负责定位与分割，深层专注于识别。
该架构设计用于处理长度可变的序列（最大长度为N），每个数字使用独立的权重矩阵进行分类。
探索滑动窗口解码策略，作为提升长序列统计效率的潜在解决方案。

实验结果

研究问题

RQ1深度卷积神经网络能否在非受限街景图像中有效实现多数字的联合定位、分割与识别？
RQ2与浅层网络架构相比，增加网络深度是否能显著提升多数字识别性能？
RQ3统一的深度学习模型能否在如扭曲reCAPTCHA谜题等具有挑战性的OCR任务上实现人类水平表现？
RQ4模型性能在多大程度上依赖于网络的深度与表征能力，而非仅参数数量？
RQ5该模型在包含数千万个标注街道路牌的大规模真实世界数据集上如何扩展？

主要发现

在单个数字识别任务上，模型达到97.84%的准确率，超越当时最先进水平。
在使用SVHN数据集的完整街道路牌识别任务中，模型准确率超过96%。
在基于街景图像构建的大规模数据集（包含数千万个标注）上，模型准确率超过90%。
在最困难的reCAPTCHA谜题类别中，模型达到99.8%的转录准确率，超过特定操作阈值下的人类表现。
性能随网络深度增加而提升，更深的架构显著优于更宽但更浅的模型，后者易出现过拟合。
模型成功转录了近1亿个街景图像中的街道路牌，达到操作员水平准确率，显著提升了多个国家的地理编码质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。