QUICK REVIEW

[论文解读] PixelLink: Detecting Scene Text via Instance Segmentation

Dan Deng, Haifeng Liu|arXiv (Cornell University)|Jan 4, 2018

Handwritten Text Recognition Techniques参考文献 22被引用 53

一句话总结

PixelLink 通过执行像素链接的实例分割来检测场景文本，避免基于回归的边界框定位，并能够直接从分割结果提取文本边界框。

ABSTRACT

Most state-of-the-art scene text detection algorithms are deep learning based methods that depend on bounding box regression and perform at least two kinds of predictions: text/non-text classification and location regression. Regression plays a key role in the acquisition of bounding boxes in these methods, but it is not indispensable because text/non-text prediction can also be considered as a kind of semantic segmentation that contains full location information in itself. However, text instances in scene images often lie very close to each other, making them very difficult to separate via semantic segmentation. Therefore, instance segmentation is needed to address this problem. In this paper, PixelLink, a novel scene text detection algorithm based on instance segmentation, is proposed. Text instances are first segmented out by linking pixels within the same instance together. Text bounding boxes are then extracted directly from the segmentation result without location regression. Experiments show that, compared with regression-based methods, PixelLink can achieve better or comparable performance on several benchmarks, while requiring many fewer training iterations and less training data.

研究动机与目标

通过利用实例分割来驱动文本检测，而不进行边界框回归。
提出一种基于像素链接的网络以分离紧邻的文本实例。
使从分割结果直接提取边界框成为可能，并与基于回归的方法进行对比。

提出的方法

带有共享 VGG16 骨干的双头 CNN，预测像素级文本/非文本与八方向像素链接。
像素被标注为文本/非文本；相邻像素之间的链接表示同一实例的连通性。
通过正链接进行实例分割，形成表示文本实例的连接分量。
从 CCs 使用 minAreaRect 提取边界框，不进行基于回归的位置预测。
实例平衡交叉熵损失结合在线难样本挖掘以实现鲁棒训练。
后处理包括简单几何过滤以去除噪声。

实验结果

研究问题

RQ1在自然场景下，是否可以通过基于像素链接的实例分割有效检测文本实例而不进行位置回归？
RQ2基于像素链接的方法在数据量或训练迭代次数方面是否少于基于回归的方法，同时达到相当或更好的准确性？
RQ3与基于回归的检测器相比，PixelLink 在标准基准（IC15、IC13、TD500）上的表现如何？
RQ4网络分辨率、链接阈值和后处理过滤对检测性能有何影响？
RQ5能否从分割结果提取的边界框足以用于比赛基准？

主要发现

PixelLink 在 IC15 上的 F 值竞争力或优于基于回归的方法，同时使用更少的训练迭代和更少的数据。
在 IC15 上，PixelLink 4s 达到 F=82.3、FPS=7.3，准确率超过一些基于回归的基线。
PixelLink 2s 显示更高的准确性但速度慢于 4s 版本（F=83.7，3.0 FPS）。
消融表明链接机制是必需的；移除链接会显著降低召回率和精确率。
实例平衡和从零开始训练实现更快收敛，且无需 ImageNet 预训练即可获得强劲表现。
在 IC13 上，PixelLink 使用 2s 和 MS，根据尺度不同，F 值约在 88.1–87.5 之间，优于若干基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。