[논문 리뷰] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
TrOCR는 사전 학습된 이미지 및 텍스트 트랜스포머를 활용한 엔드-투-엔드 Transformer 기반 OCR로 CNN 백본이나 외부 언어 모델 없이 인쇄물, 필기 및 씬 텍스트에서 최첨단 결과를 달성합니다.
Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at \url{https://aka.ms/trocr}.
연구 동기 및 목표
- Motivate end-to-end OCR that leverages pre-trained visual and language models.
- Propose a CNN-free Transformer architecture for image-to-text transcription.
- Show that pre-training on large-scale synthetic data improves downstream OCR tasks.
- Demonstrate state-of-the-art performance on printed, handwritten, and scene text benchmarks.
제안 방법
- Use a pre-trained image Transformer as encoder to process 384x384 patches from the input image.
- Use a pre-trained text Transformer as decoder to generate wordpiece tokens with encoder–decoder attention.
- Initialize encoder with DeiT/BEiT pre-trained models and decoder with RoBERTa/MiniLM variants.
- Train with a two-stage pre-training regime on large-scale synthetic data followed by fine-tuning on downstream tasks.
- Tokenize output with BPE and SentencePiece without reliance on task-specific vocabularies.
- Infer with beam search (beam size 10) to produce final wordpiece sequences.
실험 결과
연구 질문
- RQ1Can a CNN-free Transformer encoder–decoder architecture achieve competitive OCR accuracy across printed, handwritten, and scene text?
- RQ2What is the impact of pre-trained image and text transformers on OCR performance compared to CNN/RNN baselines?
- RQ3How does two-stage pre-training on synthetic data influence downstream OCR benchmarks?
- RQ4Is external language modeling necessary when using pre-trained Transformer decoders for OCR?
주요 결과
- TrOCR with BEiT encoder and RoBERTa-LARGE decoder achieves strong results across benchmarks, outperforming CNN/RNN baselines.
- On SROIE, TrOCR-LARGE attains a F1 of 96.58 (table shows 96.59 precision, 96.57 recall).
- On IAM handwriting, TrOCR-LARGE achieves a CER of 2.89, surpassing several CNN/RNN-based methods.
- On scene text benchmarks, TrOCR models establish five new state-of-the-art results across eight experiments when fine-tuned with synthetic data and benchmark data.
- Compared variants show that pre-trained image transformers (BEiT) and large decoders (RoBERTa-LARGE) yield the best performance.
- Inference speed shows that TrOCR-SMALL offers a favorable accuracy-speed trade-off with significantly fewer parameters.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.