QUICK REVIEW

[論文レビュー] Scene Text Detection via Holistic, Multi-Channel Prediction

Cong Yao, Xiang Bai|arXiv (Cornell University)|Jun 29, 2016

Handwritten Text Recognition Techniques参考文献 52被引用数 212

ひとこと要約

本論文はシーン文字検出をセマンティックセグメンテーションとして扱い、一つのFCNでテキスト領域、文字、結合方向を共同予測することで、多方向性および曲線文字の検出を可能にし、ICDAR 2013/2015, MSRA-TD500, COCO-Textで最先端の結果を達成する。

ABSTRACT

Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge. However, vast majority of the existing methods detect text within local regions, typically through extracting character, word or line level candidates followed by candidate aggregation and false positive elimination, which potentially exclude the effect of wide-scope and long-range contextual cues in the scene. To take full advantage of the rich information available in the whole natural image, we propose to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. The proposed algorithm directly runs on full images and produces global, pixel-wise prediction maps, in which detections are subsequently formed. To better make use of the properties of text, three types of information regarding text region, individual characters and their relationship are estimated, with a single Fully Convolutional Network (FCN) model. With such predictions of text properties, the proposed algorithm can simultaneously handle horizontal, multi-oriented and curved text in real-world natural images. The experiments on standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500, demonstrate that the proposed algorithm substantially outperforms previous state-of-the-art approaches. Moreover, we report the first baseline result on the recently-released, large-scale dataset COCO-Text.

研究の動機と目的

シーン文字検出を意味セグメンテーション問題として捉え、グローバルな画像コンテキストを活用する。
三つの文字関連特性を共同予測する：テキスト領域、個々の文字、文字間の結合方向。
セグメンテーションによるピクセル単位マップから検出を形成するパイプラインを開発する（セグメンテーション、グラフベースのグルーピング、パーティショニングを使用）。
標準ベンチマーク（ICDAR 2013/2015、MSRA-TD500）およびCOCO-Textでの評価を通じて、多方向性・曲線文字に対する頑健性を示す。

提案手法

HEDに触発されたFully Convolutional Network (FCN) を拡張し、画像ごとに三つの予測マップ：テキスト領域、文字（訓練時には縮小）、結合方向を出力する。
グラウンドトゥルースマップは二値の領域マップ、二値の文字マップ、ソフトな方向マップで構成され、方向は [−π/2, π/2] の範囲で定義され、[0,1] に正規化される。
テキスト領域・文字・方向の損失を組み合わせた多チャネル損失で学習し、重み付き目的関数に等しい重み（λ1=λ2=λ3=1/3）を適用する。
推論時には予測マップを生成し、適応的閾値処理でテキスト領域と文字候補を得て、デルタリオナイの三角形分割とグラフを用いて文字をテキスト行に結合する。
グラフベースのグルーピングは最大生成木と、直線性・距離・方向性に基づくスコアリングを用いて文字をテキスト行へ分割し、曲線文字にも対応する非線形レイアウトを τ の閾値で処理する。
テスト時にはスケールを跨いでフュージョンして最終検出を生成する。

実験結果

リサーチクエスチョン

RQ1シーン文字検出を局所的な領域ベースの意思決定から、画像全体にわたるピクセルレベルの総合予測へ移すことで改善できるか。
RQ2単一のFCN内で追加の文字と結合方向を予測することで、隣接する文字列の分離とグルーピングは改善されるか。
RQ3グラフベースの多チャネル予測フレームワークは自然場面での多方向性・曲線文字を頑健に検出できるか。
RQ4 holistic multi-channel text detection は標準ベンチマーク（ICDAR 2013/2015, MSRA-TD500）およびCOCO-Textで既存手法と比べてどうなるか。

主な発見

提案手法は ICDAR 2013 で recall が高く、0.8022、精度 0.8888、F値 0.8433。
ICDAR 2015 では precision 0.7226、recall 0.5869、F値 0.6477 で、recall の多くのベースラインを上回り、precision でも最良手法に近い。
MSRA-TD500 では precision 0.7651、recall 0.7531、F値 0.7591 を達成し、recall の前例より顕著に改善。
COCO-Text 検証データセットでは precision 0.4323、recall 0.271、F値 0.3331 を報告し、大規模で多様なデータセットへのスケーラビリティを示す。
アプローチは様々な言語・スクリプト・曲線文字・現実の困難な条件に対して定性的に頑健である。
推論は K40m GPU 上で約 0.42 秒/640x480 画像、CPU ポスト処理約 0.2 秒。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。