QUICK REVIEW

[論文レビュー] OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

Uwe Springmann, Anke Lüdeling|arXiv (Cornell University)|Aug 6, 2016

Handwritten Text Recognition Techniques被引用数 33

ひとこと要約

本稿では、1487年から1870年までの歴史的ドイツ語の草木図鑑（incunabulaを含む）の外交的トランスクリプションを用いて、OCRopusエンジンでトレーニングされたニューラルネットワークベースのOCRシステムを提示している。文字単位の正確性は94–99%、語単位の正確性は76–97%を達成した。この手法により、初期活字書籍、特にincunabulaを含む高精度で自動化されたデジタル化が可能となり、最小限の手作業でスケーラブルな時代的変化を反映したコーパスの構築が可能となる。

ABSTRACT

This article describes the results of a case study that applies Neural Network-based Optical Character Recognition (OCR) to scanned images of books printed between 1487 and 1870 by training the OCR engine OCRopus [@breuel2013high] on the RIDGES herbal text corpus [@OdebrechtEtAlSubmitted]. Training specific OCR models was possible because the necessary *ground truth* is available as error-corrected diplomatic transcriptions. The OCR results have been evaluated for accuracy against the ground truth of unseen test sets. Character and word accuracies (percentage of correctly recognized items) for the resulting machine-readable texts of individual documents range from 94% to more than 99% (character level) and from 76% to 97% (word level). This includes the earliest printed books, which were thought to be inaccessible by OCR methods until recently. Furthermore, OCR models trained on one part of the corpus consisting of books with different printing dates and different typesets *(mixed models)* have been tested for their predictive power on the books from the other part containing yet other fonts, mostly yielding character accuracies well above 90%. It therefore seems possible to construct generalized models trained on a range of fonts that can be applied to a wide variety of historical printings still giving good results. A moderate postcorrection effort of some pages will then enable the training of individual models with even better accuracies. Using this method, diachronic corpora including early printings can be constructed much faster and cheaper than by manual transcription. The OCR methods reported here open up the possibility of transforming our printed textual cultural heritage into electronic text by largely automatic means, which is a prerequisite for the mass conversion of scanned books.

研究の動機と目的

過去にOCRに不適切とされていた、初期のincunabulaを含む歴史的活字書籍を正確に認識できるトレーニング可能なOCRシステムの開発。
4世紀にわたる多様な歴史的ドイツ語の草木図鑑コーパスを対象に、ニューラルネットワークベースのOCRのパフォーマンスを評価すること。
複数のフォントを含む混合フォントでトレーニングされたOCRモデルが、異なるタイプフェースや印刷日付を持つ未確認の歴史的テキストへどの程度一般化できるかを評価すること。
一般化された混合モデルOCRシステムが、完全な手作業トランスクリプションの必要性を減らすために、コーパス構築の信頼できる予備的近似として機能できることを示すこと。

提案手法

スキャン画像を用いて、OCRopusエンジンを用いてLSTM-RNNアーキテクチャを採用したカスタムOCRモデルをトレーニングする。
教師あり学習のためのグランドトゥルースとして、元のテキストを誤り訂正し、グリフ単位で正確なトランスクリプションである「外交的トランスクリプション」を用いる。
RIDGESコーパスに含まれる複数の印刷日付とタイプフェースを含む多様なサブセットを用いて、混合モデルOCRシステムを構築する。
文字単位および語単位の正確性指標を用いて、未確認のテストセット上でOCRのパフォーマンスを評価する。
一部のページを最小限の後処理（post-correction）により修正し、モデルの正確性をさらに向上させる。
トレーニング済みモデルおよび生成されたOCRコーパス（RIDGES-OCR）をCC-BYライセンスで公開し、再利用とコミュニティ主導のモデル改善を促進する。

実験結果

リサーチクエスチョン

RQ1ニューラルネットワークベースのOCRは、15世紀のincunabulaを含む歴史的活字書籍においても高精度を達成できるか？
RQ2異なるタイプフェースや印刷日付を持つ未確認の歴史的テキストへ、複数フォントでトレーニングされたOCRモデルがどの程度一般化できるか？
RQ3RIDGESコーパスにおける異なる時代とタイプフェースでOCRの正確性はどのように変動するか？
RQ4一般化されたOCRモデルは、手作業トランスクリプションへの依存を減らすために、コーパス構築の実用的で信頼できる予備的近似として機能できるか？
RQ5最小限の後処理が、OCR正確性の向上およびより高精度な個別モデルのトレーニングをどのように促進するか？

主な発見

RIDGESコーパスの個々の文書において、OCR結果の文字単位正確性は94%から99%以上に達しており、15世紀のテキストに対しても同様に高い正確性を示した。
語単位正確性は76%から97%にわたり、変則的な綴りやタイプグラフィーを含む複雑な歴史的テキストに対しても強力なパフォーマンスを示した。
異なる印刷日付とタイプフェースを含む混合フォントでトレーニングされたOCRモデルは、未確認の歴史的テキストに対し、文字単位正確性が90%以上を達成しており、優れた一般化能力を示した。
少量の後処理を施したページの利用が、モデル正確性の著しい向上をもたらし、極めて高精度な個別モデルのトレーニングを可能にした。
公開されたRIDGES-OCRコーパスおよび一般化された混合モデルはCC-BYライセンスで提供されており、再利用とコミュニティ主導のモデル改善が可能である。
本研究により、最小限の手作業で高品質な機械可読テキストを大規模に生成でき、時代的変化を反映したコーパスの構築が著しく加速されることを示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。