QUICK REVIEW

[论文解读] Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Nuo Xu, Ahrii Kim|arXiv (Cornell University)|Feb 4, 2026

Natural Language Processing Techniques被引用 0

一句话总结

paper compares three tokenization paradigms (BPE, Unigram, OBPE) across six Uralic languages, showing OBPE often yields better morphological alignment and cross-lingual transfer, with Unigram performing best in extremely low-resource or isolated settings.

ABSTRACT

Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity. Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.

研究动机与目标

Assess how three subword tokenization paradigms affect downstream POS tagging in morphologically rich Uralic languages.
Evaluate cross-lingual transfer performance when training on a high-resource source language and finetuning on low-resource targets.
Determine which tokenization method yields better morphological fidelity across scripts (Latin and Cyrillic).
Analyze how resource level and genealogical proximity influence tokenization efficacy.

提出的方法

Systematic comparison of BPE, Unigram, and Overlap-based BPE (OBPE) across six Uralic languages using UD v2 datasets.
Train tokenizers on language-specific monolingual data with a fixed vocabulary size (5,000 subword units).
Evaluate downstream POS tagging with two architectures (BiLSTM-CRF and Flair) in cross-lingual transfer setup (source language -> target language).
Use three-stage preprocessing: gold-standard extraction, greedy alignment, and first-subword tagging to project labels onto subword sequences.
Tune OBPE with equal weights for compression and overlap (α = 0.5) and p = −∞ for the generalized mean to maximize minimum shared token frequency.
Report Accuracy and Macro-F1 for POS tagging as primary metrics.

实验结果

研究问题

RQ1OBPE是否在多种乌拉尔语言中实现比BPE和Unigram更高的形态对齐和词性标注准确率？
RQ2跨语言传递的表现如何随系谱接近度和脚本组（拉丁字母 vs 俄文字母）而变化？
RQ3在极低资源设置中，Unigram是否因其概率分割和子词正则化而比BPE更有效？
RQ4哪类词性类别最易受分词器选择影响（开放类别 vs 封闭类别）？

主要发现

Source	Target	Tokenizer	BiLSTM-CRF Acc	BiLSTM-CRF Macro-F1	Flair Acc	Flair Macro-F1
est	hun	BPE	0.8096	0.7013	0.9509	0.7930
est	hun	Unigram	0.7840	0.6663	0.9408	0.7651
est	hun	OBPE	0.8496	0.7398	0.9614	0.7902
est	sme	BPE	0.7749	0.7573	0.9075	0.8050
est	sme	Unigram	0.7830	0.7573	0.9078	0.7885
est	sme	OBPE	0.8152	0.7850	0.9373	0.8390
fin	hun	BPE	0.8096	0.7013	0.9509	0.7930
fin	hun	Unigram	0.7840	0.6663	0.9408	0.7651
fin	hun	OBPE	0.8514	0.7412	0.9581	0.7907
fin	sme	BPE	0.7749	0.7573	0.9075	0.8050
fin	sme	Unigram	0.7830	0.7573	0.9078	0.7885
fin	sme	OBPE	0.8036	0.7914	0.9264	0.8164
rus	kpv	BPE	0.6744	0.4742	0.3941	0.4409
rus	kpv	Unigram	0.7401	0.5209	0.9101	0.6367
rus	kpv	OBPE	0.7207	0.5022	0.8930	0.5693

在大多数语言对中，OBPE在词性标注准确率和Macro-F1方面持续高于BPE和Unigram，唯凯里克字母组中Unigram表现最佳。
在BPE条件下，匈牙利语的准确率与Macro-F1之间存在显著差距，体现罕见形式的长尾问题，OBPE能缓解其低表示。
OBPE在开放类别标注上（如北萨米的ADJ和NOUN）有提升，而固定功能类别如PUNCT在分词器之间保持稳定。
俄语→科米-齐里安的俄文字母对比表现出较大性能差距，因 typology distance 与正字法重叠，限制OBPE的跨语言收益。
Unigram在孤立的低资源环境下通常提供更具形态忠实性的分词，在数据稀缺时改进动词等形态丰富形式的表现。
词性熵(H)与分词器表现相关：词性分布更高且更均匀的语言（如匈牙利语和北萨米语）在训练数据减少时仍能保持OBPE的增益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。