[Paper Review] Reading Scene Text with Attention Convolutional Sequence Modeling
This paper proposes an end-to-end attention convolutional network for scene text recognition that uses stacked CNNs for sequence modeling (no RNNs) with residual attention, achieving competitive or state-of-the-art results on standard benchmarks in both lexicon-free and lexicon-based settings.
Reading text in the wild is a challenging task in the field of computer vision. Existing approaches mainly adopted Connectionist Temporal Classification (CTC) or Attention models based on Recurrent Neural Network (RNN), which is computationally expensive and hard to train. In this paper, we present an end-to-end Attention Convolutional Network for scene text recognition. Firstly, instead of RNN, we adopt the stacked convolutional layers to effectively capture the contextual dependencies of the input sequence, which is characterized by lower computational complexity and easier parallel computation. Compared to the chain structure of recurrent networks, the Convolutional Neural Network (CNN) provides a natural way to capture long-term dependencies between elements, which is 9 times faster than Bidirectional Long Short-Term Memory (BLSTM). Furthermore, in order to enhance the representation of foreground text and suppress the background noise, we incorporate the residual attention modules into a small densely connected network to improve the discriminability of CNN features. We validate the performance of our approach on the standard benchmarks, including the Street View Text, IIIT5K and ICDAR datasets. As a result, state-of-the-art or highly-competitive performance and efficiency show the superiority of the proposed approach.
Motivation & Objective
- Motivate and address the efficiency and accuracy challenges of scene text recognition in unconstrained scenes.
- Propose a fully convolutional architecture to replace recurrent sequence modeling for faster and parallelizable processing.
- Incorporate residual attention within a densely connected encoder to suppress background noise and enhance foreground text features.
- Enable end-to-end training on word-level annotations without relying on pre-segmented characters or fixed dictionaries.
Proposed method
- Introduce an attention feature encoder based on dense blocks with residual attention to produce a robust feature sequence from a word image.
- Transform the feature sequence into a 2D map (sequence-to-map) and apply stacked convolutional layers to model contextual dependencies without recurrence.
- Restore the CNN output back to a sequence (map-to-sequence) and apply a linear layer to obtain per-frame label distributions.
- Use Connectionist Temporal Classification (CTC) to convert per-frame distributions into final word sequences, enabling lexicon-free and lexicon-based decoding.
- Train end-to-end with word-level annotations using a negative log-likelihood objective under CTC.
- Demonstrate efficiency gains (CNN-based sequence modeling is faster and uses fewer parameters than BLSTM) while maintaining competitive accuracy.
Experimental results
Research questions
- RQ1Can a convolutional sequence modeling approach (without RNNs) achieve competitive recognition accuracy for scene text while offering computational efficiency?
- RQ2Does incorporating residual attention into a densely connected encoder improve foreground text representation and suppress background noise in scene text images?
- RQ3Is end-to-end training with word-level annotations feasible and effective for both lexicon-free and lexicon-based scene text recognition?
- RQ4How does the proposed attention convolutional network perform compared to state-of-the-art methods on SVT, IIIT5K, and ICDAR benchmarks under different lexicon settings?
Key findings
| Methods | SVT-50 | SVT | IIIT5k-50 | IIIT5k-1k | IIIT5k | IC03-50 | IC03-Full | IC03 | IC13 |
|---|---|---|---|---|---|---|---|---|---|
| Ours | 97.4 | 82.7 | 99.1 | 97.9 | 81.8 | 98.7 | 96.7 | 89.2 | 88.0 |
- Achieves competitive to state-of-the-art results on SVT, IIIT5k, IC03, and IC13, with strong lexicon-free performance.
- Demonstrates that CNN-based sequence modeling runs about 9 times faster than BLSTM while requiring fewer parameters.
- Residual attention modules improve recognition accuracy, particularly on noisy datasets like SVT and IIIT5k.
- Outperforms several prior methods in lexicon-based settings, notably on IIIT5k with 1000-word lexicon.
- The model is robust to spatial distortions and does not rely on explicit text rectification components.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.