Skip to main content
QUICK REVIEW

[論文レビュー] PubLayNet: largest dataset ever for document layout analysis

Zhong Xu, Jianbin Tang|arXiv (Cornell University)|Aug 16, 2019
Handwritten Text Recognition Techniques参考文献 21被引用数 39
ひとこと要約

PubLayNet は PubMed Central PDFs を自動的に注釈付けして大規模な文書レイアウトデータセットを作成する。PubLayNet で学習した最先端のオブジェクト検出器は高いレイアウト MAP を達成し、他の分野への転移学習を効果的に可能にする。

ABSTRACT

Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset (https://github.com/ibm-aur-nlp/PubLayNet) to support development and evaluation of more advanced models for document layout analysis.

研究の動機と目的

  • Automatically generate a large, high-quality annotated dataset of document layout from PubMed Central PDFs and their XML representations.
  • Evaluate deep object detection methods on PubLayNet for layout understanding of scientific articles.
  • Assess transfer learning benefits from PubLayNet to other document domains and compare with ImageNet/COCO pretraining.

提案手法

  • Automatically align PDF layout elements with XML annotations from PubMed Central Open Access to create per-element layout labels.
  • Segment text, titles, lists, tables, figures in PDFs using PDFMiner-based extraction and XML-guided labeling.
  • Train Faster-RCNN and Mask-RCNN models with ResNeXt-101 backbone on PubLayNet using Detectron; evaluate with MAP@IOU [0.50:0.95].
  • Partition data into train/dev/test at journal level to maximize template diversity and assess generalization.

実験結果

リサーチクエスチョン

  • RQ1How well can standard object detectors (Faster-RCNN, Mask-RCNN) learn document layout categories from PubLayNet?
  • RQ2Does pre-training on PubLayNet provide a better initialization for transferring to other domains (e.g., SPD health insurance documents) than ImageNet/COCO pretraining?
  • RQ3Can PubLayNet enable competitive performance on table detection tasks and related layout challenges with limited fine-tuning data?

主な発見

CategoryModelDev MAPTest MAP
TextF-RCNN0.9100.913
TextM-RCNN0.9160.917
TitleF-RCNN0.8260.812
TitleM-RCNN0.8400.828
ListF-RCNN0.8830.885
ListM-RCNN0.8860.887
TableF-RCNN0.9540.943
TableM-RCNN0.9600.947
FigureF-RCNN0.9370.945
FigureM-RCNN0.9490.955
Macro AvgF-RCNN0.9020.900
Macro AvgM-RCNN0.9100.907
  • Faster-RCNN and Mask-RCNN achieve high layout detection performance on PubLayNet with MAP@IOU [0.50:0.95] exceeding 0.90 on average in development and testing sets.
  • Mask-RCNN generally performs slightly better than Faster-RCNN across categories (Text, Title, List, Table, Figure).
  • Table and Figure detections are more accurate than Text, Title, and List detections, likely due to regular shapes and distinctiveness.
  • Fine-tuning models pretrained on PubLayNet achieves state-of-the-art results on ICDAR 2013 Table Recognition with very small fine-tuning data (as few as 170 pages).
  • Zero-shot PubLayNet pretraining performs worse than fine-tuned PubLayNet or COCO/ImageNet pretraining for SPD document layouts, highlighting domain transfer benefits from PubLayNet.
  • PubLayNet enables effective transfer learning to non-biomedical domains, though gains are domain-dependent (tables transfer more challenging).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。