QUICK REVIEW

[論文レビュー] Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Wenhai Wang, Enze Xie|arXiv (Cornell University)|Aug 16, 2019

Handwritten Text Recognition Techniques参考文献 56被引用数 75

ひとこと要約

PAN は、軽量なセグメンテーションヘッドと学習可能な Pixel Aggregation 後処理で任意形状テキストを検出し、曲線テキストのベンチマークでリアルタイムからほぼリアルタイムの速度で高い精度を達成します。

ABSTRACT

Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical applications.In this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

研究の動機と目的

任意形状シーンテキスト検出における速度と精度のトレードオフに対処する。
マルチスケール特徴を強化する軽量なセグメンテーションヘッドを開発する。
学習された類似性を用いてテキストピクセルをカーネルへ統合する Pixel Aggregation を導入する。
完全なテキストインスタンスを再構成するエンドツーエンドの効率的な後処理を実現する。
リアルタイム速度で曲線テキストベンチマークにおいて最先端の性能を示す。

提案手法

セグメンテーションのバックボーンとして ResNet-18 を軽量に使用する。
低コストで受容野を広げる cascaded Feature Pyramid Enhancement Module (FPEM) を導入する。
マルチ深度特徴を最終セグメンテーション特徴へ融合する Feature Fusion Module (FFM) を使用する。
各ピクセルについてテキスト領域、カーネル、類似ベクトルを予測する。
Pixel Aggregation (PA) を適用して学習された類似ベクトルを用いてテキストピクセルを対応するカーネルへ導く。
テキスト/kernels 損失と Pixel Aggregation 損失（L_agg, L_dis）を組み合わせて学習し、セグメンテーションには dice 損失を使用する。

実験結果

リサーチクエスチョン

RQ1軽量なセグメンテーションヘッド (FPEM + FFM) が高い速度を維持しつつ、任意形状テキスト検出の性能ギャップを埋められるか。
RQ2PA はリアルタイムでカーネルから完全なテキストインスタンスの正確な再構成を可能にするか。
RQ3PA および FPEM の cascade 深さが、曲線テキストや多方向ベンチマークにおける精度とスループットに与える影響はどの程度か。
RQ4PAN は CTW1500、Total-Text、その他のベンチマークにおいて、F-measure と FPS の点で最先端手法と比較してどうか。

主な発見

PAN は曲線テキストベンチマーク（CTW1500 と Total-Text）で最先端の F-measure に対抗する性能を発揮しつつ、高い FPS を実現する（例: 外部事前学習なしで CTW1500 上の PAN-320 は約 84.2 FPS、CTW1500 上の PAN-640 は約 39.8 FPS）。
cascaded な FPEM は特徴表現を改善し、追加コストを最小限に抑えながら、2 段階の cascaded FPEM が速度と精度のバランスを有利にする。
FFM はマルチ深度特徴を低オーバーヘッドで効果的に統合し、単純な結合よりも精度で優れ、同等の速度を維持する。
Pixel Aggregation (PA) は学習された類似ベクトルを介してテキストピクセルをカーネルへ合わせることで精度を向上させ、PA を除去した場合の ablative で意味のある向上が見られる。
SynthText による事前学習を利用するとさらに性能が向上する（例: PAN-320 的には CTW1500 で約 79.9%、PAN-640 は Total-Text で最大 85.0% F）。
PAN は曲線テキストで強力な性能を示し、リアルタイムまたはほぼリアルタイムの速度を維持しつつ、CTW1500、Total-Text、ICDAR 2015、MSRA-TD500 のいずれのベースラインよりも精度と速度の両方で優位に立つ。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。