QUICK REVIEW

[論文レビュー] Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Nuo Xu, Ahrii Kim|arXiv (Cornell University)|Feb 4, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

この論文は3つのトークン化パラダイム（BPE、Unigram、OBPE）を6言語のウラル語で比較し、 OBPE が形態的整合性とクロスリンガル転移をより良くすることが多い一方、極端に低リソースまたは孤立した設定では Unigram が最も良い性能を示す、という結論を示しています。

ABSTRACT

Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity. Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.

研究の動機と目的

Assess how three subword tokenization paradigms affect downstream POS tagging in morphologically rich Uralic languages.
Evaluate cross-lingual transfer performance when training on a high-resource source language and finetuning on low-resource targets.
Determine which tokenization method yields better morphological fidelity across scripts (Latin and Cyrillic).
Analyze how resource level and genealogical proximity influence tokenization efficacy.

提案手法

Systematic comparison of BPE, Unigram, and Overlap-based BPE (OBPE) across six Uralic languages using UD v2 datasets.
Train tokenizers on language-specific monolingual data with a fixed vocabulary size (5,000 subword units).
Evaluate downstream POS tagging with two architectures (BiLSTM-CRF and Flair) in cross-lingual transfer setup (source language -> target language).
Use three-stage preprocessing: gold-standard extraction, greedy alignment, and first-subword tagging to project labels onto subword sequences.
Tune OBPE with equal weights for compression and overlap (α = 0.5) and p = −∞ for the generalized mean to maximize minimum shared token frequency.
Report Accuracy and Macro-F1 for POS tagging as primary metrics.

実験結果

リサーチクエスチョン

RQ1Does OBPE yield higher morphological alignment and POS tagging accuracy than BPE and Unigram across diverse Uralic languages?
RQ2How does cross-lingual transfer performance vary with genealogical proximity and script group (Latin vs Cyrillic)?
RQ3Is Unigram more effective than BPE in extremely low-resource settings due to its probabilistic segmentation and subword regularization?
RQ4Which POS categories are most affected by tokenizer choice (open-class vs closed-class)?

主な発見

Source	Target	Tokenizer	BiLSTM-CRF Acc	BiLSTM-CRF Macro-F1	Flair Acc	Flair Macro-F1
est	hun	BPE	0.8096	0.7013	0.9509	0.7930
est	hun	Unigram	0.7840	0.6663	0.9408	0.7651
est	hun	OBPE	0.8496	0.7398	0.9614	0.7902
est	sme	BPE	0.7749	0.7573	0.9075	0.8050
est	sme	Unigram	0.7830	0.7573	0.9078	0.7885
est	sme	OBPE	0.8152	0.7850	0.9373	0.8390
fin	hun	BPE	0.8096	0.7013	0.9509	0.7930
fin	hun	Unigram	0.7840	0.6663	0.9408	0.7651
fin	hun	OBPE	0.8514	0.7412	0.9581	0.7907
fin	sme	BPE	0.7749	0.7573	0.9075	0.8050
fin	sme	Unigram	0.7830	0.7573	0.9078	0.7885
fin	sme	OBPE	0.8036	0.7914	0.9264	0.8164
rus	kpv	BPE	0.6744	0.4742	0.3941	0.4409
rus	kpv	Unigram	0.7401	0.5209	0.9101	0.6367
rus	kpv	OBPE	0.7207	0.5022	0.8930	0.5693

OBPE consistently achieves higher POS tagging accuracy and Macro-F1 than BPE and Unigram in most language pairs, except in Cyrillic where Unigram performs best.
Hungarian shows a notable gap between Accuracy and Macro-F1 under BPE, illustrating long-tail issues where OBPE mitigates underrepresentation of rare forms.
OBPE improves open-class category tagging (e.g., North Sámi ADJ and NOUN) compared to BPE, while fixed function classes like PUNCT remain stable across tokenizers.
The Cyrillic (Russian→Komi-Zyrian) pairing exhibits a large performance gap due to typological distance and orthographic overlap, limiting OBPE’s cross-lingual gains.
Unigram generally provides more morphologically faithful segmentation in isolated low-resource settings, improving VERB and other morpheme-rich forms when data are scarce.
POS entropy (H) correlates with tokenizer performance: languages with higher and more even tag distributions (e.g., Hungarian and North Sámi) sustain OBPE gains under reduced training data.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。