QUICK REVIEW

[論文レビュー] TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Yuliang Liu, Biao Yang|arXiv (Cornell University)|Mar 7, 2024

Natural Language Processing Techniques被引用数 12

ひとこと要約

TextMonkeyはOCRを使わない大規模 multimodal モデルで、Shifted Window Attention、トークンリサンプリング、テキストグラウンディングを用いて高解像度の視覚-テキスト推論を改善し、シーンテキスト、文書、OCRベンチマークで強力な利益を達成します。

ABSTRACT

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. It also learns to perform screenshot tasks through finetuning. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9\% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding. Code will be released at https://github.com/Yuliang-Liu/Monkey.

研究の動機と目的

OCRエラーや外部パイプラインを回避するための文書理解へのOCR-freeアプローチを動機付ける。
文書や現場の高密度テキストを扱える高解像度のクロスウィンドウ多模態エンコーダを開発する。
重要情報を失わずにトークン冗長性を削減するトークンリサンプリング戦略を導入する。
テキストスポッティングとテキストグラウンディングを有効にして、解釈性を向上させLLMベースの回答の幻覚を減らす。
OCRBenchを含む広範なベンチマークスイートで強力な実証的利益を示す。

提案手法

高解像度画像をスライディングウィンドウモジュールを用いて非重複の448x448ウィンドウに分割する。
各ウィンドウ内でCLIPのトランスフォーマーブロックを適用し、クロスウィンドウ接続を可能にするためゼロ初期化のShifted Window Attentionを使用する。
256の学習可能クエリを持つImage Resamplerを用いて視覚特徴を固定長（256）に圧縮し、2D位置エンコーディングを保持する。
重要なトークンを類似度ベースの基準（1 - 最大トークン類似度）で選択してトークン長を削減し、その後クロスアテンションで特徴を再集約するトークンリサンプリングを導入する。
画像特徴をLarge Language Model（7.7B）と共同処理して回答を生成し、タスク横断のOCR-freeエンドツーエンド推論を可能にする。
位置情報を意識したタスク（テキストスポッティング、読取、VQAグラウンディング）と構造化データ微調整を組み合わせ、テキストと場所情報の整合性を改善する。
多様で公開されているシーンテキストと文書理解データセットのミックスで訓練し、次に構造化データ微調整ステージを経てTextMonkey†を形成する。

実験結果

リサーチクエスチョン

RQ1OCR-freeの大規模多模態モデルは、外部OCRツールに頼らず高解像度の密度の高い文書画像をどう処理できるか？
RQ2クロスウィンドウ接続とトークン圧縮は、シーンと文書全体でのテキスト認識とグラウンディングを改善できるか？
RQ3テキストスポッティングとテキストグラウンディングを統合することで、解釈性を向上させLLMベースの回答の幻覚を減らせるか？
RQ4OCR-freeアプローチの利点は、シーンテキスト、文書指向、KIEベンチマークで、従来のオープンソースLMMと比較してどの程度か？

主な発見

TextMonkeyはScene Text-Centric VQAタスク（STVQA、TextVQA、OCRVQA）で5.2%の改善を達成。
TextMonkeyはDocument-Oriented VQAタスク（DocVQA、InfoVQA、ChartVQA、DeepForm、Kleister Charity、WikiTableQuestions）で6.9%の改善を達成。
TextMonkeyはKey Information Extractionタスク（FUNSD、SROIE、POIE）で2.8%の改善を達成。
TextMonkeyはTotal-Text、CTW1500、ICDAR 2015におけるシーンテキストスポッティング精度で10.9%の向上を示す。
TextMonkeyは新たなOCRBenchスコア561（29件のOCR関連評価）を設定し、文書理解の点で従来のオープンソースLMMを上回る。
TextMonkey†はさらに改善：STVQA/DocVQA/ChartQA/InfoVQAで61.2%、一部構成でOCRBenchライクな統合評価指標で72.2%に。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。