QUICK REVIEW

[논문 리뷰] mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Jiabo Ye, Anwen Hu|arXiv (Cornell University)|2023. 07. 04.

Natural Language Processing Techniques인용 수 17

한 줄 요약

mPLUG-DocOwl은 고정된 LLM과 시각적 추상기를 정렬하는 통합 지시 튜닝으로 OCR-없이 문서 이해를 탁월하게 수행하도록 확장된 mPLUG-Owl이며, 작업별 미세 조정 없이도 여러 문서 데이터셋에서 최첨단 결과를 달성합니다.

ABSTRACT

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

연구 동기 및 목표

OCR-없는 문서 이해를 모듈형 MLLM 프레임워크에 문서별 지시 튜닝을 통합하여 향상시키는 것을 목표로 한다.
언어 전용, 일반 비전-및-언어, 문서 이해 능력을 통합 지시 튜닝을 통해 균형 있게 구현한다.
각 다운스트림 작업에서 광범위한 미세 조정 없이도 강한 제로샷 및 도메인 내 성능을 enable한다.

제안 방법

mPLUG-Owl 기반의 모듈식 아키텍처를 시각적 추상기와 고정된 언어 모델과 함께 사용한다.
시각적 추상기 및 LoRA 매개변수를 미세 조정하는 반면 시각 인코더와 LLM은 고정 상태를 유지한다.
문서, 표, 차트 및 자연 이미지 작업을 포괄하는 지시 튜닝 코퍼스를 통합 프롬프트 형식으로 구성한다.
두 번째 학습 단계에서 업샘플링을 통해 언어 전용 및 일반 비전-및-언어 지시 데이터를 포함한다.
사람 주석이 있는 OCR-없는 문서 이해 시험 세트(LLMDoc)를 사용하여 평가한다.

실험 결과

연구 질문

RQ1다양한 문서 유형(문서, 표, 차트, 웹페이지)에 대해 Heavy한 작업별 미세 조정 없이도 unified instruction tuning이 OCR-없는 문서 이해를 개선할 수 있는가?
RQ2mPLUG-DocOwl은 OCR-없는 문서 이해와 일반 단일 및 다중 모달 능력을 얼마나 잘 균형 잡아가나?
RQ3상식 추론, 계산, 창의적 생성 측면에서 OCR-없는 문서 이해의 한계는 무엇인가?
RQ4사람이 평가한 문서 지시 데이터셋(LLMDoc)에서의 성능은 기존 MLMM과 비교해 어떠한가?

주요 결과

모델	DocVQA	InfoVQA	DeepForm	KLC	WTQ	TabFact
Dessurt	63.2	-	-	-	-	-
Donut	67.5	11.6	61.6	30.0	18.8	54.6
Pix2Struct base	72.1	38.2	-	-	-	-
mPLUG-DocOwl	62.2	38.2	42.6	30.3	26.9	60.2

mPLUG-DocOwl은 각 작업별 미세 조정 없이도 여러 문서 이해 벤치마크에서 OCR-없는 최첨단 또는 경쟁적 성능을 달성한다.
언어 전용 및 일반 비전-및-언어 지시 튜닝 데이터를 포함함으로써 다운스트림 작업에 일반화된다.
LLMDoc 평가에서 mPLUG-DocOwl은 기존 MLMM 대비 문서 도메인 전반에 걸쳐 시각-텍스트 이해가 훨씬 강력하게 나타난다.
사람 평가에 따르면 문서 관련 상식 추론, 계산 및 창의적 생성에서 남아 있는 과제들이 있어 개선 여지가 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.