QUICK REVIEW

[논문 리뷰] StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Yuechen Yu, Yulin Li|arXiv (Cornell University)|2023. 03. 01.

Handwritten Text Recognition Techniques인용 수 18

한 줄 요약

StrucTexTv2는 텍스트 영역 마스킹이 적용된 이미지 전용 인코더를 사전 학습하여 마스된 이미지 영역과 토큰을 함께 재구성하고, OCR 전처리 없이 다섯 가지 문서 이해 태스크에서 강력한 성능을 달성한다.

ABSTRACT

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

연구 동기 및 목표

OCR 병목 현상을 피하기 위해 이미지 전용 입력으로 엔드투엔드 문서 이미지 이해를 촉진한다.
사전 학습을 위한 텍스트 영역 수준 마스핑 방식을 제안한다.
시각적 및 텍스트 의미를 포착하기 위해 픽셀 재구성 및 토큰 예측을 공동으로 학습한다.

제안 방법

이중 분기 인코더: CNN 시각 추출기 + 다중 스케일 융합을 위한 FPN을 포함한 트랜스포머 시맨틱 모듈.
텍스트 영역에서의 두 가지 자기지도 사전 학습 작업: 마스킹된 언어 모델링(MLM)과 마스킹된 이미지 모델링(MIM).
MLM: 텍스트 영역을 마스킹하고 ROI-Align 특징을 사용한 가벼운 2층 MLP로 마스킹된 단어 토큰을 예측한다.
MIM: Emb_style(스타일 임베딩)과 Emb_content(콘텐츠 임베딩)를 결합한 FCN을 사용하여 마스킹된 텍스트 영역의 원시 픽셀 값을 회귀한다.
IIT-CDIP Test Collection 1.0에서 사전 학습; 다운스트림 태스크는 이미지 전용 입력 및 ROI 기반 영역 처리를 사용한다.

실험 결과

연구 질문

RQ1텍스트 영역 마스킹이 적용된 이미지 전용 사전 학습이 OCR 기반 다중 모달 접근법과 비견되거나 더 높은 성능을 달성할 수 있는가?
RQ2MLM과 MIM이 문서 이미지의 시각적-텍스트 표현 학습에 어떻게 기여하는가?
RQ3마스킹 비율과 인코딩 백본 선택이 다운스트림 문서 이해 태스크에 어떤 영향을 미치는가?

주요 결과

StrucTexTv2-Small은 RVL-CDIP에서 93.40% 정확도로 달성했다(이미지 전용 입력).
StrucTexTv2-Large은 RVL-CDIP에서 94.62% 정확도로 달성했다(이미지 전용 입력).
PubLayNet에서 StrucTexTv2-Small과 StrucTexTv2-Large은 각각 95.4%와 95.5%의 mAP를 달성한다.
WTW에서 StrucTexTv2-Small은 표 셀 구조 인식에서 78.9% F1 스코어를 달성한다.
FUNSD에서 StrucTexTv2-Small은 문서 OCR에 대해 84.1% 1-NED, 엔드투엔드 정보 추출에 대해 55.0% 1-NED를 달성한다.
아블레이션 결과 MLM과 MIM의 결합이 각각의 태스크보다 RVL-CDIP 및 PubLayNet에서 더 나은 성능을 보이며, 최적의 마스킹 비율은 대략 0.30이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.