QUICK REVIEW

[论文解读] Single Shot Text Detector with Regional Attention

Pan He, Weilin Huang|arXiv (Cornell University)|Sep 1, 2017

Handwritten Text Recognition Techniques参考文献 31被引用 58

一句话总结

一个基于 SSD 的单 shot 文本检测器，配备 Text Attention Module (TAM) 和 Hierarchical Inception Module (HIM)，直接输出单词级边界框，并在 ICDAR 2013/2015 与 COCO-Text 数据集上达到 state-of-the-art 结果。

ABSTRACT

We present a novel single-shot text detector that directly outputs word-level bounding boxes in a natural image. We propose an attention mechanism which roughly identifies text regions via an automatically learned attentional map. This substantially suppresses background interference in the convolutional features, which is the key to producing accurate inference of words, particularly at extremely small sizes. This results in a single model that essentially works in a coarse-to-fine manner. It departs from recent FCN- based text detectors which cascade multiple FCN models to achieve an accurate prediction. Furthermore, we develop a hierarchical inception module which efficiently aggregates multi-scale inception features. This enhances local details, and also encodes strong context information, allow- ing the detector to work reliably on multi-scale and multi- orientation text with single-scale images. Our text detector achieves an F-measure of 77% on the ICDAR 2015 bench- mark, advancing the state-of-the-art results in [18, 28]. Demo is available at: http://sstd.whuang.org/.

研究动机与目标

解决自然图像中不同尺度与方向的单词级文本检测的挑战。
在单次处理过程中消除多阶段自下而上的处理，直接给出单词边界框。
通过专门的模块提升对多尺度和多方向文本的特征表征。
引入文本特定的监督信号，通过注意机制学习粗略的文本区域。
提高鲁棒性和速度，以实现其实用的单-shot 文本检测。

提出的方法

引入一个 Text Attention Module (TAM)，学习像素级文本掩码并将文本区域注意力注入 Aggregated Inception Features (AIFs)。
开发一个 Hierarchical Inception Module (HIM)，聚合多尺度 inception 特征并跨层融合信息，形成更丰富的 AIFs。
将 TAM 与 HIM 集成到 SSD 框架中，在一次前向中直接产生单词级边界框（配合简单的 NMS）。
端到端训练，使用像素级文本掩码损失作为辅助监督，引导注意力学习。
在空间位置上利用具有多尺度、多样尺度与纵横比的默认框集合，预测 N 个单词边界框，包含方向参数。
在 ICDAR 2013、ICDAR 2015 和 COCO-Text 上进行评估，以证明 state-of-the-art 的性能与效率。

实验结果

研究问题

RQ1一个基于单-shot 的 SSD 检 detector 能否通过文本特定模块扩展，直接预测不需要后处理的单词级边界框？
RQ2TAM 和 HIM 是否提升了自然场景中多尺度和多方向文本检测的召回率和精确度？
RQ3在标准基准（ICDAR 2013/2015，COCO-Text）上，该方法在准确性和速度方面的表现如何？

主要发现

在 ICDAR 2013（0.87）和 ICDAR 2015（0.77）基准上实现了 state-of-the-art 的 F-measure。
在 COCO-Text 上以 0.37 的 F-score 超越竞争方法，展示了良好的泛化能力。
单-shot detector 搭配 TAM 和 HIM 在 704x704 输入、单个 GPU 上的处理时间为 0.13 秒。
TAM 和 HIM 独立提升召回率和精确度，TAM+HIM 组合在 overall F-measure 上达到最佳（ICDAR 2013 为 0.87）。
该方法在小型、具有多尺度和多方向文本的单词级精度方面保持较高水平，无需复杂后处理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。