QUICK REVIEW

[论文解读] You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

Thibault Clérice|arXiv (Cornell University)|Jul 19, 2022

Infrared Target Detection Methodologies被引用 2

一句话总结

本文提出 YALTAi，一种用 YOLOv5 目标检测替代 Kraken 的基于像素的版面分割的方法，显著提升了在小型历史文献数据集上的准确率与速度。其在列检测方面相比 Kraken 提升高达 100 倍，主体区域检测得分翻倍，同时发布了新的开源工具包与两个历史文献基准数据集。

ABSTRACT

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.

研究动机与目标

为解决 Kraken 在小样本数据集（≤1110 个样本）上的表现不佳问题，特别是在区分相邻文本区域（如列与页边注释）方面。
克服版面分析中基于像素分类与多边形化的局限性，这些局限性阻碍了主体文本的准确提取。
提出从多边形与像素标注转向使用边界框的目标检测，以提升效率与准确率。
发布两个新数据集——YALTAi-Tables 与 YALTAi-MSS-EPB——用于在历史文献上进行版面分割的训练与评估。
开发 YALTAi，一个可插拔的工具包，将 YOLOv5 集成至 Kraken 的处理流程中，支持类似 Kraken 的命令行界面（CLI）的 YOLO 基础区域检测。

提出的方法

将版面分割重构为使用 YOLOv5 的目标检测任务，预测等角边界框而非像素级分割。
将 ALTO XML 注释转换为 YOLOv5 兼容的标签格式（如类别 ID、归一化中心点、宽高），用于训练。
在两个新数据集上训练 YOLOv5n 与 YOLOv5x 模型：YALTAi-Tables（16 世纪至 20 世纪初的表格型文献）与 YALTAi-MSS-EPB（9 世纪至 16 世纪的手稿与早期印刷书籍）。
通过 YALTAi 工具包将 YOLOv5 检测集成至 Kraken 的处理流程中，用 YOLOv5 替代 Kraken 的分割器，同时保留 Kraken 的行序列化与 OCR 工作流。
使用 Segmonto 本体论对文档区域（如 Main、DropCapital、MarginText）进行一致标注，确保两个数据集间的一致性。
通过命令行界面实现模型推理与 ALTO 与 YOLOv5 格式之间的转换，该界面与 Kraken 的接口保持一致。

实验结果

研究问题

RQ1在小型历史文献数据集上，YOLOv5 的目标检测能否优于 Kraken 的基于像素的分割方法？
RQ2从多边形化与像素分类转向边界框检测，对文档版面分割的准确率与推理速度有何影响？
RQ3YOLOv5 在未见过的历史文献版面（尤其是复杂多列或表格格式）上的泛化能力如何？
RQ4在小样本场景下，模型大小与架构（YOLOv5n 与 YOLOv5x）对性能与效率的影响如何？
RQ5将 YOLOv5 集成至 Kraken 的处理流程中，是否能在提升分割质量的同时保持与现有 HTR 与 OCR 工作流的兼容性？

主要发现

在 Segmonto 数据集中，YOLOv5x 在 Main 区域的平均精度（mAP）达到 47.75%，而 Kraken 仅为 6.98%，提升超过 6 倍。
在 YALTAi-Tables 数据集中，YOLOv5x 在 Col 区域的 mAP 为 4.77%，Header 区域为 12.9%，而 Kraken 分别仅得 0.09% 与 0.1%。
YOLOv5n 在除 RunningTitle 外的所有区域均优于 Kraken，其在 Main 区域的 mAP 达 34.63%，而 Kraken 为 6.98%。
YOLOv5 模型展现出显著更快的推理速度，单张图像的中位预测时间分别为 0.004 秒（YOLOv5n）与 0.025 秒（YOLOv5x），而 Kraken 缺乏批量处理能力且训练时间更长。
YOLOv5 在未见过的表格型文档上表现出更优的泛化能力，能正确检测并分离多个列，而 Kraken 常将它们合并为单一区域。
与 Kraken 相比，YALTAi 工具包将 GPU 内存使用量最多降低 50%，峰值功耗降低 30%，同时保持高准确率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。