QUICK REVIEW

[论文解读] StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Yuechen Yu, Yulin Li|arXiv (Cornell University)|Mar 1, 2023

Handwritten Text Recognition Techniques被引用 18

一句话总结

StrucTexTv2 仅图像编码器进行预训练，使用文本区域掩码来共同重建被遮挡的图像区域和标记，在不进行 OCR 预处理的情况下，在五个文档理解任务上取得了强劲的结果。

ABSTRACT

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

研究动机与目标

以仅图像输入推动端到端的文档图像理解，避免 OCR 瓶颈。
提出一个基于文本区域级别的掩码方案用于预训练。
联合学习像素重建和标记预测，以捕捉视觉和文本语义。

提出的方法

两分支编码器：CNN 视觉提取器 + 集成多尺度融合的 FPN 变换器语义模块。
对文本区域进行的两种自监督预训练任务：Masked Language Modeling (MLM) 和 Masked Image Modeling (MIM)。
MLM：对文本区域进行掩码，并通过一个轻量的两层 MLP 使用 ROI-Align 特征预测被掩盖的词令。
MIM：使用融合风格（Emb_style）和内容（Emb_content）嵌入的 FCN 回归被掩码文本区域的原始像素值。
在 IIT-CDIP Test Collection 1.0 上进行预训练；下游任务使用仅图像输入和基于 ROI 的区域处理。

实验结果

研究问题

RQ1采用文本区域掩码的纯图像预训练是否能够达到与基于 OCR 的多模态方法相竞争甚至优越的性能？
RQ2MLM 和 MIM 如何有助于学习文档图像的视觉-文本表示？
RQ3掩码比例和编码骨干网络选择对下游文档理解任务有哪些影响？

主要发现

StrucTexTv2-Small 在 RVL-CDIP 上取得 93.40% 的准确率（仅图像输入）。
StrucTexTv2-Large 在 RVL-CDIP 上取得 94.62% 的准确率（仅图像输入）。
在 PubLayNet 上，StrucTexTv2-Small 和 StrucTexTv2-Large 分别达到 95.4% 和 95.5% 的 mAP。
在 WTW 上，StrucTexTv2-Small 在表格单元格结构识别上达到 78.9% 的 F1 分数。
在 FUNSD 上，StrucTexTv2-Small 实现 84.1% 的 1-NED（文档 OCR）和 55.0% 的 1-NED（端到端信息提取）。
消融实验表明，结合 MLM 与 MIM 的结果优于单独任一任务，对 RVL-CDIP 和 PubLayNet 的结果更好；最佳掩码比例约为 0.30。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。