[Paper Review] WeText: Scene Text Detection under Weak Supervision
WeText proposes a weakly supervised scene text detection framework that trains accurate character detectors using only 229 fully annotated images by leveraging large-scale unannotated or weakly annotated data. It uses a lightweight supervised model to mine positive character samples from weakly supervised data, integrates regression-based detection to reduce error accumulation, and achieves state-of-the-art performance with minimal human annotation.
The requiring of large amounts of annotated training data has become a common constraint on various deep learning systems. In this paper, we propose a weakly supervised scene text detection method (WeText) that trains robust and accurate scene text detection models by learning from unannotated or weakly annotated data. With a "light" supervised model trained on a small fully annotated dataset, we explore semi-supervised and weakly supervised learning on a large unannotated dataset and a large weakly annotated dataset, respectively. For the unsupervised learning, the light supervised model is applied to the unannotated dataset to search for more character training samples, which are further combined with the small annotated dataset to retrain a superior character detection model. For the weakly supervised learning, the character searching is guided by high-level annotations of words/text lines that are widely available and also much easier to prepare. In addition, we design an unified scene character detector by adapting regression based deep networks, which greatly relieves the error accumulation issue that widely exists in most traditional approaches. Extensive experiments across different unannotated and weakly annotated datasets show that the scene text detection performance can be clearly boosted under both scenarios, where the weakly supervised learning can achieve the state-of-the-art performance by using only 229 fully annotated scene text images.
Motivation & Objective
- Address the high cost and scarcity of fully annotated scene text datasets in deep learning.
- Reduce error accumulation in character-based scene text detection by eliminating separate candidate generation and classification stages.
- Enable effective training of robust text detectors using weak supervision, such as word-level or text-line-level annotations, instead of expensive character-level annotations.
- Demonstrate that weakly supervised learning can achieve performance close to fully supervised models with minimal human-annotated data.
Proposed method
- Train a lightweight supervised model on a small set of fully annotated character images.
- Use the lightweight model to infer and mine positive character candidates from large unannotated or weakly annotated datasets.
- Apply semi-supervised learning by combining mined samples with the original annotated data for retraining.
- Implement weakly supervised learning by guiding character candidate mining using high-level word or text-line annotations, which are easier to collect.
- Design a proposal-free, regression-based deep network that directly predicts character bounding boxes and text confidence in a single forward pass.
- Integrate the detection and classification steps into one unified network to minimize error propagation and improve accuracy and efficiency.
Experimental results
Research questions
- RQ1Can weakly supervised learning significantly improve scene text detection performance when only a small number of fully annotated images are available?
- RQ2How effective is mining positive character samples from unannotated or weakly annotated data in boosting detector performance?
- RQ3Can a unified regression-based detector outperform traditional two-stage character detection pipelines in terms of accuracy and error accumulation?
- RQ4Does the performance of weakly supervised models scale with the size of the weakly annotated dataset?
- RQ5To what extent can iterative self-training improve model performance in weakly supervised scene text detection?
Key findings
- The weakly supervised WeText model achieves state-of-the-art performance on ICDAR 2013 using only 229 fully annotated character images.
- The COCO-Text_Weakly_TL model outperforms both FORU_Semi_TL and FORU_Weakly_TL, demonstrating that larger weakly annotated datasets lead to better performance.
- On the SWT dataset, the proposed method improves F-score to 59.8% with weakly supervised learning, surpassing the baseline (53.9%) and other prior methods.
- Iterative self-training improves the weakly supervised model’s F-score from 85.4% to 86.2% after two rounds, approaching the performance of the fully supervised model (86.2% vs. 86.4%).
- The model processes images in 0.32 seconds per image on a Titan X GPU, showing strong potential for real-time applications.
- Qualitative results show significant improvements in recall and reduced false positives, especially when training on larger weakly annotated datasets like COCO-Text.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.