QUICK REVIEW

[论文解读] Towards Accurate Scene Text Recognition with Semantic Reasoning Networks

Deli Yu, Xuan Li|arXiv (Cornell University)|Mar 27, 2020

Handwritten Text Recognition Techniques参考文献 45被引用 56

一句话总结

本文提出 Semantic Reasoning Network (SRN) 与 Global Semantic Reasoning Module (GSRM)，用于融合并行视觉特征和全局语义上下文，以端到端的场景文本识别，在多项基准测试上实现了最先进的结果，并具备更快的并行推理。

ABSTRACT

Scene text image contains two levels of contents: visual texture and semantic information. Although the previous scene text recognition methods have made great progress over the past few years, the research on mining semantic information to assist text recognition attracts less attention, only RNN-like structures are explored to implicitly model semantic information. However, we observe that RNN based methods have some obvious shortcomings, such as time-dependent decoding manner and one-way serial transmission of semantic context, which greatly limit the help of semantic information and the computation efficiency. To mitigate these limitations, we propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition, where a global semantic reasoning module (GSRM) is introduced to capture global semantic context through multi-way parallel transmission. The state-of-the-art results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method. In addition, the speed of SRN has significant advantages over the RNN based methods, demonstrating its value in practical use.

研究动机与目标

激发使用语义信息以辅助场景文本识别，超越纯视觉特征。
开发一个可扩展的端到端可训练框架，用于并行建模全局语义上下文。
提出一个 parallel visual attention module (PVAM) 与一个 visual-semantic fusion decoder (VSFD)，以整合视觉和语义线索。
通过对多样化文本基准的广泛实验，展示效率和鲁棒性。

提出的方法

以 ResNet50+FPN 为骨干，并接入 transformer 单元，以捕捉全局视觉上下文。
Parallel Visual Attention Module (PVAM) 以并行方式产生 N 个对齐的一维视觉特征。
Global Semantic Reasoning Module (GSRM) 使用一个 visual-to-semantic embedding 块和一个基于堆叠 transformer 单元的语义推理块来产生语义特征 S。
Visual-Semantic Fusion Decoder (VSFD) 采用门控单元将视觉特征 G 与语义特征 S 融合为最终预测。
端到端可训练目标：Loss = embedding loss (L_e) + reasoning loss (L_r) + final decoding loss (L_f).

实验结果

研究问题

RQ1全局的、多维的语义推理是否能够提升场景文本识别性能，相较于单向或顺序语义建模？
RQ2如何在并行、端到端框架中有效融合视觉和语义信息？
RQ3并行注意力对视觉特征加上全局语义推理是否能在加速推理的同时维持准确性？
RQ4在无词典的情况下，SRN 在规则、非规则和非拉丁长文本基准上的表现如何？
RQ5GSRM 配置（Transformer 单元数量）和融合策略对性能有何影响？

主要发现

在多个公开基准上，SRN 与 GSRM 实现了最先进的性能，包括规则、非规则和非拉丁长文本数据集。
PVAM 使视觉特征可对齐到每个目标字符的并行化对齐，相对于时间依赖的注意力提高了效率。
GSRM 通过建模全局语义上下文提供显著收益，多向（并行）推理优于单向语义推理变体。
VSFD 通过门控融合有效平衡视觉与语义线索，实现无需词典的稳健识别。
相较于基于 RNN 的语义模型，推理速度因并行处理而提升，同时在长文本上仍保持高准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。