QUICK REVIEW

[论文解读] Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Wenhai Wang, Enze Xie|arXiv (Cornell University)|Aug 16, 2019

Handwritten Text Recognition Techniques参考文献 56被引用 75

一句话总结

PAN 通过轻量级分割头和可学习的像素聚合后处理来检测任意形状文本，在曲线文本基准上实现实时到近实时速度的同时，取得强劲的准确性。

ABSTRACT

Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical applications.In this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

研究动机与目标

在任意形状场景文本检测中解决速度与准确度的权衡。
开发一个轻量级分割头以增强多尺度特征。
引入像素聚合，将文本像素通过学习得到的相似性聚合到核。
实现端到端、高效的后处理以重建完整的文本实例。
在曲线文本基准上展示与最先进方法相当的性能并实现实时速度。

提出的方法

将 ResNet-18 作为分割的轻量级骨干。
引入级联特征金字塔增强模块（FPEM）以低成本扩大感受野。
使用特征融合模块（FFM）将多深度特征融合为最终分割特征。
为每个像素预测文本区域、核和相似性向量。
应用像素聚合（PA）利用学习得到的相似性向量引导文本像素到其对应的核。
结合文本/核损失和像素聚合损失（L_agg、L_dis）进行训练，并对分割使用 dice 损失。

实验结果

研究问题

RQ1在保持高速的同时，轻量级分割头（FPEM + FFM）是否能缩小任意形状文本检测的性能差距？
RQ2像素聚合是否能够实时从核中准确重构完整的文本实例？
RQ3PA 以及 FPEM 的级联深度对曲线文本和多方向基准的准确性与吞吐量有何影响？
RQ4在 F 值和 FPS 方面，PAN 相较于 CTW1500、Total-Text 及其他基准的最先进方法表现如何？

主要发现

PAN 在曲线文本基准（CTW1500 与 Total-Text）上实现与最先进方法相媲美的 F 值，同时具有较高的 FPS（例如，PAN-320 在 CTW1500 上约 84.2 FPS，且无需外部预训练；PAN-640 在 CTW1500 上约 39.8 FPS）。
级联 FPEM 在几乎不增加额外成本的情况下提升特征表示，两个级联 FPEMs 提供了更有利的速度/准确性平衡。
FFM 以低开销有效融合多深度特征，在准确性上优于简单拼接，且速度相近。
像素聚合（PA）通过学习得到的相似性向量将文本像素对齐到核，从而提高准确性；消融实验表明移除 PA 时会有明显下降。
使用 SynthText 预训练进一步提升性能（例如 PAN-320 在 CTW1500 上约 79.9% 的 F；PAN-640 在 Total-Text 上最高可达 85.0% 的 F）。
PAN 在曲线文本上表现出色，同时保持实时或近实时速度，在 CTW1500、Total-Text、ICDAR 2015 和 MSRA-TD500 的准确性和速度方面领先于若干基线的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。