QUICK REVIEW

[論文レビュー] Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull|arXiv (Cornell University)|Aug 25, 2023

Handwritten Text Recognition Techniques被引用数 18

ひとこと要約

Nougat は、文書ページのエンドツーエンドOCRを実行して軽量のマークアップテキストを生成するエンコーダ-デコーダ型ビジュアル変換器です。大規模な arXiv/PMC データセットで訓練され、コードとモデルとともに公開されています。

ABSTRACT

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

研究の動機と目的

PDFおよびスキャン済みの本から、特に数式を含む意味的構造を回復する必要性を動機づける。
ページ画像をマークアップに変換するOCRフリーのビジュアル文書理解アプローチを提案する。
科学文書のマークアップ生成のための対になったデータセットと事前学習済みモデルを作成・公開する。

提案手法

エンコーダー: Swin Transformer がページ画像を潜在パッチ埋め込みへ処理する。
デコーダー: Transformerベースの自己回帰生成器（DonutおよびmbARTに触発）が埋め込みをマークアップ語彙へ変換する。
訓練: AdamWを用いたエンドツーエンド最適化を、3エポックのバッチと大きなシーケンス長（S=4096）で行い、推論にはGreedyデコードを用いる。
データ拡張: 画像ゆらぎと真偽トークンの摂動を用いてスキャンを模倣し、反復崩壊を抑制する。
データセット構築: LaTeXML前処理とページ分割整列を介してarXiv、PMC、Industry Documents Libraryから自動生成された対ペアデータ。
反復処理対応: 推論時の反復を抑止するための反復防止拡張と経験的反復検出器。

Figure 1: Our simple end-to-end architecture followin Donut [ 28 ] . The Swin Transformer encoder takes a document image and converts it into latent embeddings, which are subsequently converted to a sequence of tokens in a auto-regressive manner

実験結果

リサーチクエスチョン

RQ1OCRフリーのビジョン・トランスフォーマーは、テキスト・数式・表を含む構造化マークアップへ、文書ページ画像を正確に変換できるか？
RQ2科学文書全体で、プレーンテキスト・数式・表の性能はどの程度か？
RQ3モデルサイズ（250M vs 350Mパラメータ）とデコード戦略が精度と速度に与える影響は？
RQ4外部OCRツールを使わずにエンドツーエンド学習を可能にするデータと拡張戦略は何か？

主な発見

手法	モダリティ	編集距離 ↓	BLEU ↑	METEOR ↑	適合率 ↑	再現率 ↑	F1 ↑
PDF	全体	0.255	65.8	82.1	77.1	81.4	79.2
GROBID	全体	0.312	55.6	71.9	74.0	72.1	73.0
	表	0.626	25.1	64.5	61.4	80.7	69.7
+ LaTeX OCR	プレーンテキスト	0.363	57.4	69.2	82.1	70.5	75.9
	数式	0.727	0.3	5.0	11.0	8.6	9.7
Nougat small (250M ∗ )	全体	0.073	88.9	92.8	93.6	92.2	92.9
	表	0.220	68.5	78.6	75.0	79.8	77.3
プレーンテキスト	0.058	91.0	94.3	96.1	95.3	95.7
	数式	0.117	56.0	74.7	77.1	76.8	76.9
Nougat base (350M ∗ )	全体	0.071	89.1	93.0	93.5	92.8	93.1
	表	0.211	69.7	79.1	75.4	80.7	78.0
プレーンテキスト	0.058	91.2	94.6	96.2	95.3	95.7
	数式	0.128	56.9	75.4	76.5	76.6	76.5

Nougat small (250M) は、arXivテストセットで評価した場合、プレーンテキストと数式の高いスコアを含む、モダリティ横断の強力な統一性能を達成する。
Nougat base (350M) はプレーンテキストと数式の結果をさらに改善し、提案モデルの中で最良の総合指標を達成する。
プレーンテキスト出力はモデル間で約91.0–91.2 BLEUと95.3–95.7 F1に達し、堅牢なテキスト回復を示す。
数式モダリティは、LaTeXと式の表現の曖昧さのためプレーンテキストより低いスコアだが、ベースラインと比較して改善されている。
表は中程度の性能向上を受け、BLEUとF1は通常プレーンテキストより低く、方程式と表からの構造化コンテンツ抽出の難しさを示している。
ドメイン内性能（arXiv）は、提案モデルがほとんどの指標でGROBIDおよびOCRベースのベースラインを上回ることを示している。

Figure 2: List of the different image augmentation methods used during training on an example snippet form a sample document.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。