QUICK REVIEW

[論文レビュー] NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach

Torsten Hiltmann, Martin Dröge|ArXiv.org|Feb 4, 2025

Natural Language Processing Techniques被引用数 3

ひとこと要約

この論文は、 humanities-informed prompts を用いた out-of-the-box の LLM が spaCy および Flair を歴史的 NER で上回り、特に文脈とペルソナモデリングを取り入れると性能が向上することを示しています。ゼロショット prompting は最大 16 ショットまでなら少数ショット prompting を上回ることがある、という結論を示しています。

ABSTRACT

Named entity recognition (NER) is a core task for historical research in automatically establishing all references to people, places, events and the like. Yet, do to the high linguistic and genre diversity of sources, only limited canonisation of spellings, the level of required historical domain knowledge, and the scarcity of annotated training data, established approaches to natural language processing (NLP) have been both extremely expensive and yielded only unsatisfactory results in terms of recall and precision. Our paper introduces a new approach. We demonstrate how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks, spaCy and flair, for NER in historical documents by seven to twentytwo percent higher F1-Scores. Our ablation study shows how providing historical context to the task and a bit of persona modelling that turns focus away from a purely linguistic approach are core to a successful prompting strategy. We also demonstrate that, contrary to our expectations, providing increasing numbers of examples in few-shot approaches does not improve recall or precision below a threshold of 16-shot. In consequence, our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools and instead leveraging natural language prompts and consumer-grade tools and frontends.

研究の動機と目的

NER を純粋な言語手がかりではなく、ドメイン文脈に導かれた人文学的タスクとして再定義する。
歴史的 NER に対するターゲット提示を用いた商用 LLM の性能を評価する。
歴史的コーパスに対して、LLM の prompting 戦略を spaCy と Flair の確立された NER パイプラインと比較する。
文脈、ペルソナモデリング、少数/多数ショットの変化がリコール、精度、F1 に与える影響を特定する。
人文学研究における NER の民主化へ向けた言語効果と実践的含意を評価する。

提案手法

Berlin の 1921 年版 Baedekers Travel Guide から作成した地上真実コーパスを PER、ORG、LOC に注釈付けして使用する。
LLM 出力に対してスパンベースのタグ風注釈形式を用いた地上 truth トークン範囲を適用する。
文脈情報とペルソナモデリングを含むドイツ語/英語の prompting scheme を開発・適用する。
文脈、ペルソナモデリング、ゼロ-/少数-/多数ショット prompting の影響を定量化するアブレーション研究を実施する。
NL M の出力を Flair と spaCy のベースラインと比較し、ファジースパンマッチと nervaluate による正確性指標で評価する。

Figure 1: Main instructions in the German prompt

実験結果

リサーチクエスチョン

RQ1finetuning なしで readily available な LLM が歴史的テキストに対して高い NER のリコールと精度を達成できるか。
RQ2 prompting の文脈、ペルソナモデリング、ショット数が歴史コーパスの NER に与える影響は何か。
RQ3人文学的な prompting を用いた LLM は歴史的文書で spaCy と Flair を上回るか。
RQ4 prompting 言語（英語 vs ドイツ語）が歴史的 prompting の NER に与える影響はあるか。
RQ5歴史的 NER の利用においてコスト、コンテキスト窓、注釈作業といった実務的考慮事項は何か。

主な発見

Context-Impact	lang	Recall	Precision	F1-Score
Full Prompt	de	0.84 ±0.10	0.91 ±0.08	0.87 ±0.08
Full Prompt	en	0.85 ±0.09	0.91 ±0.06	0.88 ±0.07
Specific Context	de	0.81 ±0.19	0.87 ±0.19	0.84 ±0.19
Specific Context	en	0.86 ±0.08	0.89 ±0.08	0.88 ±0.07
Generic Context	de	0.81 ±0.11	0.90 ±0.10	0.85 ±0.09
No Context	de	0.75 ±0.15	0.90 ±0.09	0.81 ±0.11
Baseline flair	--	0.76 ±0.13	0.89 ±0.10	0.81 ±0.11
Baseline spaCy	--	0.71 ±0.13	0.62 ±0.11	0.66 ±0.10
(additional prompts and context)	--	--	--	--

人文学的 prompting を用いた LLM は歴史的 NER において recall と precision の両方で spaCy および Flair を大きく上回る。
ゼロショット prompting は少数ショット prompting を 16 件程度の例まで上回ることがあり、より多くの例が必ずしも有利であるとは限らない、という仮説に挑戦する。
文脈情報とペルソナモデリングは効果的な prompting の核心であり、一般的な文脈なしまたは文脈が乏しい場合の性能は大幅に低下する。
ドイツ語と英語の prompting は、文脈要素とペルソナ要素を含む場合に同等の結果を示す。
ベースライン Flair は German NER で recall 0.76、precision 0.89、F1 0.81 を達成、spaCy は LOC で recall 0.82、PER で 0.76、ORG で 0.12 の recall（全体 F1 0.50）。
ゼロショットおよび文脈豊富な prompting は注釈データの必要性を低減し、歴史家にとっての参入障壁を下げる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。