QUICK REVIEW

[論文レビュー] PixelLink: Detecting Scene Text via Instance Segmentation

Dan Deng, Haifeng Liu|arXiv (Cornell University)|Jan 4, 2018

Handwritten Text Recognition Techniques参考文献 22被引用数 53

ひとこと要約

PixelLinkはシーンテキストをピクセルリンクによるインスタンス分割を実行して検出し、回帰ベースの境界ボックス局在化を回避し、分割結果からテキストの境界ボックスを直接抽出できる。

ABSTRACT

Most state-of-the-art scene text detection algorithms are deep learning based methods that depend on bounding box regression and perform at least two kinds of predictions: text/non-text classification and location regression. Regression plays a key role in the acquisition of bounding boxes in these methods, but it is not indispensable because text/non-text prediction can also be considered as a kind of semantic segmentation that contains full location information in itself. However, text instances in scene images often lie very close to each other, making them very difficult to separate via semantic segmentation. Therefore, instance segmentation is needed to address this problem. In this paper, PixelLink, a novel scene text detection algorithm based on instance segmentation, is proposed. Text instances are first segmented out by linking pixels within the same instance together. Text bounding boxes are then extracted directly from the segmentation result without location regression. Experiments show that, compared with regression-based methods, PixelLink can achieve better or comparable performance on several benchmarks, while requiring many fewer training iterations and less training data.

研究の動機と目的

boundingボックス回帰を使わずにインスタンス分割を活用してテキスト検出を動機づける。
密集したテキストインスタンスを分離するためのピクセルリンクベースのネットワークを提案する。
分割結果から直接境界ボックスを抽出できるようにし、回帰ベース手法と比較する。

提案手法

テキスト/非テキストを予測する共有VGG16バックボーンを持つ二頭CNNと8方向のピクセルリンクを予測。
ピクセルはテキスト/非テキストとしてラベル付けされ；隣接ピクセル間のリンクは同一インスタンスの連結性を示す。
正のリンクによるインスタンス分割で、テキストインスタンスを表す連結成分を形成。
回帰ベースの位置予測なしにminAreaRectを用いてCCから境界ボックスを抽出。
信頼性のある訓練のためのインスタンスバランス付きクロスエントロピー損失とOnline Hard Example Mining。
ノイズを除去する単純な幾何学的フィルタリングを含む後処理。

実験結果

リサーチクエスチョン

RQ1ピクセルリンクを用いたインスタンス分割で自然画像中のテキストインスタンスを効果的に検出できるか？
RQ2ピクセルリンクベースの手法は回帰ベースよりデータ量や訓練反復を少なくて済み、同等またはそれ以上の精度を達成できるか？
RQ3PixelLinkは回帰ベースの検出器と比較して標準ベンチマーク(IC15, IC13, TD500)でどう性能を示すか？
RQ4ネットワーク解像度、リンク閾値、後処理が検出性能に与える影響は？
RQ5分割結果からの境界ボックス抽出はコンペティションのベンチマークに十分か？

主な発見

Model	Recall (R)	Precision (P)	F-score (F)	FPS
PixelLink+VGG16 2s	82.0	85.5	83.7	3.0
PixelLink+VGG16 4s	81.7	82.9	82.3	7.3
EAST+PVANET2x MS	78.3	83.3	81.0	—
EAST+PVANET2x	73.5	83.6	78.2	13.2
EAST+VGG16	72.8	80.5	76.4	6.5
SegLink+VGG16	76.8	73.1	75.0	—
CTPN+VGG16	51.6	74.2	60.9	—

PixelLinkはIC15で回帰ベース手法と比較して競争力のある、または上回るF値を達成しつつ、訓練反復数とデータ量を削減。
IC15でPixelLink 4sはF=82.3、7.3 FPSで、複数の回帰ベースのベースラインより精度が高い。
PixelLink 2sは精度が高いが速度は4sより遅い（F=83.7、3.0 FPS）。
アブレーションでリンク機構が必須で、リンクを削除するとリコールと精度が大幅に低下。
Instance-Balanceとスクラッチからの訓練により収束が速く、ImageNet事前訓練なしでも強力な性能。
IC13では2sとMSを用いたPixelLinkがスケールに応じてFが約88.1–87.5で、いくつかのベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。