QUICK REVIEW

[论文解读] mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Jiabo Ye, Anwen Hu|arXiv (Cornell University)|Jul 4, 2023

Natural Language Processing Techniques被引用 17

一句话总结

简要结论：mPLUG-DocOwl 通过将统一指令微调与冻结的语言模型对齐的视觉摘要器，扩展了 mPLUG-Owl，在无需任务特定微调的情况下实现对 OCR-free 文档理解的出色表现，并在若干文档数据集上达到最新性水平。

ABSTRACT

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

研究动机与目标

通过将面向文档的指令微调集成到一个模块化多模态大模型框架中，提升 OCR-free 文档理解。
通过统一指令微调，在语言-only、通用视觉-语言和文档理解能力之间取得平衡。
在不对每个下游任务进行大规模微调的情况下，实现强烈的零-shot 和领域内性能。

提出的方法

在基于 mPLUG-Owl 的模块化架构上，加入视觉摘要器和冻结的语言模型。
对视觉摘要器和 LoRA 参数进行微调，同时保持视觉编码器和 LLM 不变。
以统一的提示格式构建覆盖文档、表格、图表和自然图像任务的指令微调语料库。
在第二阶段训练中加入语言-only 和通用视觉-语言指令数据，并进行上采样。
使用带有人类标注的 OCR-free 文档理解测试集 (LLMDoc) 进行评估。

实验结果

研究问题

RQ1统一指令微调是否能够在不进行大量特定任务微调的情况下，提升跨多种文档类型（文档、表格、图表、网页）的 OCR-free 文档理解？
RQ2mPLUG-DocOwl 在 OCR-free 文档理解与通用单一与多模态能力之间的平衡表现如何？
RQ3在常识推理、计算和创造性生成方面，OCR-free 文档理解有哪些局限？
RQ4在经过精心构造并人工评估的文档指令数据集（LLMDoc）上，mPLUG-DocOwl 的表现如何相对于现有的 MLMMs？

主要发现

模型	DocVQA	InfoVQA	DeepForm	KLC	WTQ	TabFact
Dessurt	63.2	-	-	-	-	-
Donut	67.5	11.6	61.6	30.0	18.8	54.6
Pix2Struct base	72.1	38.2	-	-	-	-
mPLUG-DocOwl	62.2	38.2	42.6	30.3	26.9	60.2

mPLUG-DocOwl 在多个文档理解基准上实现了 OCR-free 的最新或具竞争力的表现，且无需对每个任务进行微调。
由于包含语言-only 和通用视觉-语言指令微调数据，模型对下游任务具有良好的泛化能力。
在 LLMDoc 评估中，mPLUG-DocOwl 在跨文档领域的视觉-文本理解明显优于现有的 MLMMs。
人工评估指出在文档相关的常识推理、计算和创造性生成方面仍存在挑战，提示改进方向。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。