Skip to main content
QUICK REVIEW

[Paper Review] TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Yuliang Liu, Biao Yang|arXiv (Cornell University)|Mar 7, 2024
Natural Language Processing Techniques12 citations
TL;DR

TextMonkey is an OCR-free large multimodal model for text-centric document understanding that uses Shifted Window Attention, a token resampler, and text grounding to improve high-resolution visual-text reasoning, achieving strong gains across scene text, documents, and OCR benchmarks.

ABSTRACT

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. It also learns to perform screenshot tasks through finetuning. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9\% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding. Code will be released at https://github.com/Yuliang-Liu/Monkey.

Motivation & Objective

  • Motivate OCR-free approaches to document understanding to avoid OCR errors and external pipelines.
  • Develop a high-resolution, cross-window mult-modal encoder capable of handling dense text in documents and scenes.
  • Introduce a token resampling strategy to reduce token redundancy without losing crucial information.
  • Enable text spotting and text grounding to improve interpretability and reduce hallucinations in LLM-based answers.
  • Demonstrate strong empirical gains across a broad suite of benchmarks including OCRBench.

Proposed method

  • Divide high-resolution images into non-overlapping 448x448 windows using a sliding window module.
  • Within each window, apply transformer blocks from CLIP; use Shifted Window Attention with zero initialization to enable cross-window connectivity.
  • Use an Image Resampler with 256 learnable queries to compress visual features to a fixed length (256) and preserve 2D positional encoding.
  • Introduce a Token Resampler that selects important tokens via a similarity-based criterion (1 - max token similarity) to reduce token length, then uses cross-attention to re-aggregate features.
  • Jointly process image features with a Large Language Model (7.7B) to produce answers, enabling OCR-free end-to-end reasoning across tasks.
  • Incorporate position-aware tasks (text spotting, reading text, VQA grounding) and structured data fine-tuning to improve alignment between text and location information.
  • Train on a diversified, publicly available dataset mix for scene text and document understanding, followed by a structured-data fine-tuning stage to form TextMonkey†.

Experimental results

Research questions

  • RQ1How can OCR-free large multimodal models handle high-resolution document images with dense text without relying on external OCR tools?
  • RQ2Can cross-window connectivity and token compression improve recognition and grounding of text across scenes and documents?
  • RQ3Does integrating text spotting and text grounding improve interpretability and reduce hallucinations in LLM-based responses?
  • RQ4What are the gains of OCR-free approaches across scene-text, document-oriented, and KIE benchmarks compared to prior open-source LMMs?

Key findings

  • TextMonkey achieves a 5.2% improvement in Scene Text-Centric VQA tasks (STVQA, TextVQA, OCRVQA).
  • TextMonkey achieves a 6.9% improvement in Document-Oriented VQA tasks (DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, WikiTableQuestions).
  • TextMonkey achieves a 2.8% improvement in Key Information Extraction tasks (FUNSD, SROIE, POIE).
  • TextMonkey shows a 10.9% gain in scene text spotting accuracy across Total-Text, CTW1500, and ICDAR 2015.
  • TextMonkey sets a new OCRBench score of 561 (29 OCR-related evaluations), surpassing previous open-source LMMs for document understanding.
  • TextMonkey† further improves: 61.2% (STVQA/DocVQA/ChartQA/InfoVQA) and 72.2% in the combined OCRBench-like evaluation metrics for some configurations.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.