QUICK REVIEW

[論文レビュー] mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Jiabo Ye, Anwen Hu|arXiv (Cornell University)|Jul 4, 2023

Natural Language Processing Techniques被引用数 17

ひとこと要約

mPLUG-DocOwl は、モジュール型の指示調整で OCR なしの文書理解を卓越させるために、視覚的要約者を凍結した LLM と整合させることにより、mPLUG-Owl を拡張し、タスク固有の微調整なしで複数の文書データセットで最新性能を達成します。

ABSTRACT

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

研究の動機と目的

OCRなしの文書理解を向上させることを目指し、文書固有の指示調整をモジュール式の MLLM フレームワークに組み込む。
統一された指示調整を通じて、言語のみ、汎用の視覚言語、および文書理解機能のバランスを取る。
各ダウンストリームタスクでの広範なファインチューニングなしに、強力なゼロショットおよびドメイン内パフォーマンスを実現する。

提案手法

視覚的要約機と凍結された言語モデルを備えた、mPLUG-Owl をベースにしたモジュラーアーキテクチャを使用。
視覚エンコーダと LLM を凍結したまま、視覚的要約機と LoRA パラメータを微調整する。
統一されたプロンプト形式で、文書・表・グラフ・自然画像タスクを網羅する指示調整コーパスを構築する。
第二のトレーニング段階で、言語のみと一般的な視覚・言語指示データをアップサンプリングして取り入れる。
人間のアノテーションを含む OCR-free 文書理解テストセット (LLMDoc) を用いて評価する。

実験結果

リサーチクエスチョン

RQ1統一された指示調整は、 heavy task-specific fine-tuning を要さずに、さまざまな文書タイプ（文書、表、グラフ、ウェブページ）で OCR なしの文書理解を改善できるか？
RQ2mPLUG-DocOwl は OCR-free 文書理解と一般的な単一・マルチモーダル能力のバランスをどれだけ上手く取れるか？
RQ3常識推論、計算、創造的生成の観点で、OCR-free 文書理解の制約は何か？
RQ4注意深く構築された人間評価付きの文書指示データセット（LLMDoc）に対して、既存の MLMM との比較で mPLUG-DocOwl はどの程度優れているか？

主な発見

Model	DocVQA	InfoVQA	DeepForm	KLC	WTQ	TabFact
Dessurt	63.2	-	-	-	-	-
Donut	67.5	11.6	61.6	30.0	18.8	54.6
Pix2Struct base	72.1	38.2	-	-	-	-
mPLUG-DocOwl	62.2	38.2	42.6	30.3	26.9	60.2

mPLUG-DocOwl は、タスクごとのファインチューニングを行わずに、複数の文書理解ベンチマークで OCR-free の最先端または競合的な性能を達成する。
言語のみおよび一般的な視覚言語指示調整データを含むことで、下流タスクへの一般化能力が高まる。
LLMDoc 評価では、既存の MLMM よりも文書ドメイン全体の視覚・テキスト理解が大幅に強化される。
人間評価は、文書関連の常識推論、計算、創造的生成にはなお課題があり、改善の余地が示唆される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。