QUICK REVIEW

[논문 리뷰] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

Minghao Li, Tengchao Lv|arXiv (Cornell University)|2021. 09. 21.

Handwritten Text Recognition Techniques참고 문헌 41인용 수 77

한 줄 요약

TrOCR는 사전 학습된 이미지 및 텍스트 트랜스포머를 활용한 엔드-투-엔드 Transformer 기반 OCR로 CNN 백본이나 외부 언어 모델 없이 인쇄물, 필기 및 씬 텍스트에서 최첨단 결과를 달성합니다.

ABSTRACT

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at \url{https://aka.ms/trocr}.

연구 동기 및 목표

Motivate end-to-end OCR that leverages pre-trained visual and language models.
Propose a CNN-free Transformer architecture for image-to-text transcription.
Show that pre-training on large-scale synthetic data improves downstream OCR tasks.
Demonstrate state-of-the-art performance on printed, handwritten, and scene text benchmarks.

제안 방법

Use a pre-trained image Transformer as encoder to process 384x384 patches from the input image.
Use a pre-trained text Transformer as decoder to generate wordpiece tokens with encoder–decoder attention.
Initialize encoder with DeiT/BEiT pre-trained models and decoder with RoBERTa/MiniLM variants.
Train with a two-stage pre-training regime on large-scale synthetic data followed by fine-tuning on downstream tasks.
Tokenize output with BPE and SentencePiece without reliance on task-specific vocabularies.
Infer with beam search (beam size 10) to produce final wordpiece sequences.

실험 결과

연구 질문

RQ1Can a CNN-free Transformer encoder–decoder architecture achieve competitive OCR accuracy across printed, handwritten, and scene text?
RQ2What is the impact of pre-trained image and text transformers on OCR performance compared to CNN/RNN baselines?
RQ3How does two-stage pre-training on synthetic data influence downstream OCR benchmarks?
RQ4Is external language modeling necessary when using pre-trained Transformer decoders for OCR?

주요 결과

TrOCR with BEiT encoder and RoBERTa-LARGE decoder achieves strong results across benchmarks, outperforming CNN/RNN baselines.
On SROIE, TrOCR-LARGE attains a F1 of 96.58 (table shows 96.59 precision, 96.57 recall).
On IAM handwriting, TrOCR-LARGE achieves a CER of 2.89, surpassing several CNN/RNN-based methods.
On scene text benchmarks, TrOCR models establish five new state-of-the-art results across eight experiments when fine-tuned with synthetic data and benchmark data.
Compared variants show that pre-trained image transformers (BEiT) and large decoders (RoBERTa-LARGE) yield the best performance.
Inference speed shows that TrOCR-SMALL offers a favorable accuracy-speed trade-off with significantly fewer parameters.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.