QUICK REVIEW

[論文レビュー] Guided Attention for Large Scale Scene Text Verification

Dafang He, Yeqing Li|arXiv (Cornell University)|Apr 23, 2018

Handwritten Text Recognition Techniques参考文献 22被引用数 2

ひとこと要約

本論文は、境界ボックスのアノテーションや明示的なテキスト検出・認識を必要とせず、与えられたテキスト文字列がシーン画像内に存在するかどうかを検証するエンドツーエンドフレームワークであるGuided Attentionを提案する。この手法は、大規模で困難なStreet View Business Matchingタスクにおいて最先端の性能を達成し、従来のシーンテキスト読み取りベースの手法と比較して優れた結果を示している。

ABSTRACT

Many tasks are related to determining if a particular text string exists in an image. In this work, we propose a new framework that learns this task in an end-to-end way. The framework takes an image and a text string as input and then outputs the probability of the text string being present in the image. This is the first end-to-end framework that learns such relationships between text and images in scene text area. The framework does not require explicit scene text detection or recognition and thus no bounding box annotations are needed for it. It is also the first work in scene text area that tackles suh a weakly labeled problem. Based on this framework, we developed a model called Guided Attention. Our designed model achieves much better results than several state-of-the-art scene text reading based solutions for a challenging Street View Business Matching task. The task tries to find correct business names for storefront images and the dataset we collected for it is substantially larger, and more challenging than existing scene text dataset. This new real-world task provides a new perspective for studying scene text related problems. We also demonstrate the uniqueness of our task via a comparison between our problem and a typical Visual Question Answering problem.

研究の動機と目的

シーンテキスト検出や認識に依存せずに、シーン画像内にテキストが存在するかを検証するエンドツーエンドフレームワークの開発。
境界ボックスアノテーションの必要性を排除することで、シーンテキスト検証の弱教師あり性を扱う。
実世界のシーンテキスト検証を対象とし、特にStreet View Business Matchingを想定した大規模で困難なデータセットの作成。
標準的な視覚的質問応答（Visual Question Answering）問題とは異なった、この検証タスクの独自性を示すこと。

提案手法

フレームワークは画像とテキスト文字列を入力とし、テキストの存在確率を直接出力する。
入力テキストに対応する関連画像領域に注目するためのガイド付きアテンション機構を採用し、テキストと視覚的特徴の間の整合性を向上させる。
弱教師あり学習を用いてエンドツーエンドでモデルを訓練するが、境界ボックスラベルは不要で、画像-テキストペアのみが必要である。
明示的なシーンテキスト検出や認識を回避することで、高コストなアノテーションへの依存を低減する。
本タスクを支援するための新しいデータセットを収集し、既存のシーンテキストデータセットと比較してより困難で多様性のある店舗外観画像を含んでいる。

実験結果

リサーチクエスチョン

RQ1境界ボックスアノテーションや明示的なテキスト検出を必要とせず、エンドツーエンドでシーン画像内のテキスト存在を検証できるか？
RQ2実世界のビジネスマッチングタスクにおいて、提案されたフレームワークは最先端のシーンテキスト読み取りベースの手法と比較してどのように性能を発揮するか？
RQ3弱教師あり学習とエンドツーエンド学習が、シーンテキスト検証の精度に与える影響は何か？
RQ4本提案の検証タスクは、視覚的質問応答（Visual Question Answering）とは、タスク定式化と要件の面でどのように異なるか？

主な発見

Guided Attentionモデルは、困難なStreet View Business Matchingタスクにおいて、いくつかの最先端のシーンテキスト読み取りベースのソリューションを上回る性能を達成した。
境界ボックスアノテーションに依存しないことで、弱教師あり学習の有効性を実証した。
本タスク用に収集されたデータセットは、既存のシーンテキストデータセットと比較して大幅に大きく、より困難であることが明らかになった。
フレームワークの性能は、視覚的質問応答とは異なり、正確なテキスト一致に焦点を当てるという点で、本検証タスクの独自性を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。