QUICK REVIEW

[论文解读] FP-THD: Full page transcription of historical documents

H Neji, J Nogueras-Iso|arXiv (Cornell University)|Jan 20, 2026

Handwritten Text Recognition Techniques被引用 0

一句话总结

FP-THD 提出一个全页转写流水线，通过将布局分析与扩展的 MAE-ViT OCR 相结合，确保在手写与印刷文本中对历史字符与符号的忠实保存。

ABSTRACT

The transcription of historical documents written in Latin in XV and XVI centuries has special challenges as it must maintain the characters and special symbols that have distinct meanings to ensure that historical texts retain their original style and significance. This work proposes a pipeline for the transcription of historical documents preserving these special features. We propose to extend an existing text line recognition method with a layout analysis model. We analyze historical text images using a layout analysis model to extract text lines, which are then processed by an OCR model to generate a fully digitized page. We showed that our pipeline facilitates the processing of the page and produces an efficient result. We evaluated our approach on multiple datasets and demonstrate that the masked autoencoder effectively processes different types of text, including handwritten, printed and multi-language.

研究动机与目标

开发一个管线以转写整页历史文档，同时保留旧字符与符号。
整合布局分析模块以提取文本行再进行OCR。
扩展基于掩码自编码器的 OCR（MAE-ViT）以处理印刷、手写及多语言文本。
提供包括 PAGE XML 与可读性强的 Markdown/TXT 表述之输出。
在多样数据集上进行评估并展示对历史排版特征的保留。

提出的方法

使用 ParseNet 进行布局分析以检测基线、区域和行并输出 PAGE-XML。
将检测到的文本行裁剪并整RECT 成统一的 50 像素高图像用于 OCR。
采用带 CNN 特征提取器（ResNet-18）和跨度掩蔽的扩展 MAE-ViT OCR，以实现对手写与印刷文本的鲁棒识别，无需后处理。
在数据集特定的行图像上训练 MAE-ViT，掩码比率为 0.4，最大跨度长度 8，训练迭代次数为 100k。
生成多种输出包括 PAGE-XML、Markdown 与纯 TXT，以便下游分析和 OCR 性能测量。

Figure 1: FP-THD architecture Overview: Layout Analysis and Masked Auto-encoder with Vision Transformer

实验结果

研究问题

RQ1一个全页转写流水线能否在不现代化历史字符与缩写的前提下保留它们吗？
RQ2以布局分析优先的方法在中世纪拉丁文文献的转写准确性上有多大提升？
RQ3基于 MAE-ViT 的 OCR 在手写、印刷和多语言历史文本上的有效性如何？
RQ4流水线是否能同时产出可供人类标注的有用表示（Markdown）和机器可读输出？
RQ5与现有历史拉丁文本数据集的转写方法相比，FP-THD 的表现如何？

主要发现

Model	CER (%)	WER (%)
BVPB [26]	0.3379	0.6835
Pero-OCR [20]	0.0242	0.2106
FP-THD	0.0178	0.0450

基于 MAE-ViT 的 OCR 在 Rodrigo 上实现了 CER 1.30% 与 WER 6.97%，在 Bentham 上实现了 CER 4.46% 与 WER 7.68%，且无需后处理。
Molino 印刷文本转写在验证集上实现 CER 1.43% 与 WER 5.39%。
FP-THD 在 Molino 数据集上在 CER（0.0178）与 WER（0.0450）上优于 Pero-OCR 与 ABBY 的转写。
该流水线能够保留对中世纪拉丁文转写重要的变音符号与颤音符。
使用 ParseNet 的布局分析提供了结构化的行区域，使全页重建为 XML 与文本格式时更加准确。

Figure 2: Example text lines by datasets.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。