QUICK REVIEW

[论文解读] SVTR: Scene Text Recognition with a Single Visual Model

Yongkun Du, Zhineng Chen|arXiv (Cornell University)|Apr 30, 2022

Handwritten Text Recognition Techniques被引用 25

一句话总结

SVTR 提出一个单一的视觉模型，通过将图像切分为字符组件并应用局部和全局混合块来识别场景文本，省去了独立序列模型的需求。它在推理更快的同时实现与最先进准确度相竞争，包含一个用于资源受限场景的微型变体。

ABSTRACT

Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. The code is publicly available at https://github.com/PaddlePaddle/PaddleOCR.

研究动机与目标

以单一视觉模型实现准确的场景文本识别，而不是混合的 CNN/RNN 或编码器-解码器框架。

提出的方法

通过渐进式重叠嵌入，将图像分块为字符组件，进行逐步的 patch 令牌化。
三阶段高度渐进的骨干网络，包含局部（笔画样式）和全局（字符间）混合块。
合并与组合操作，用于构建多尺度表示并通过线性预测产生最终字符序列。
单一视觉模型取代复杂的语言感知流程，实现跨语言识别。
模型变体 SVTR-T、SVTR-S、SVTR-B、SVTR-L，具有递增的容量与速度特性。

实验结果

研究问题

RQ1单一视觉模型是否能够在场景文本识别中达到或超过语言增强或跨模态模型的准确性？
RQ2局部与全局组件级混合块是否能够实现有效的多粒度字符特征感知？
RQ3基于补丁的多阶段处理以及合并/组合在英语和中文场景文本识别中是否足够稳健？
RQ4在 SVTR 变体之间，模型大小、准确性与推理速度之间存在哪些权衡？

主要发现

SVTR 在英语基准测试上以单一视觉模型实现具备竞争力的准确性，并在中文文本识别上取得更优结果。
SVTR-L 在提供强大准确性的同时，运行速度比许多可比方法更快。
SVTR-T 提供一个高效且显著更小的模型，推理速度快（在 NVIDIA 1080Ti 上大约每张图像 4.5 ms）。
所提出的局部与全局混合块，以及多尺度骨干网络，使多粒度字符特征感知成为可能（笔画状局部模式与字符间依赖）。
渐进重叠补丁嵌入和阶段性高度降低（合并）有助于提升效率和准确性，消融研究显示补丁嵌入的选择和混合块置换的好处。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。