Skip to main content
QUICK REVIEW

[论文解读] Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Nuo Xu, Ahrii Kim|arXiv (Cornell University)|Feb 4, 2026
Natural Language Processing Techniques被引用 0
一句话总结

paper compares three tokenization paradigms (BPE, Unigram, OBPE) across six Uralic languages, showing OBPE often yields better morphological alignment and cross-lingual transfer, with Unigram performing best in extremely low-resource or isolated settings.

ABSTRACT

Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity. Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.

研究动机与目标

  • Assess how three subword tokenization paradigms affect downstream POS tagging in morphologically rich Uralic languages.
  • Evaluate cross-lingual transfer performance when training on a high-resource source language and finetuning on low-resource targets.
  • Determine which tokenization method yields better morphological fidelity across scripts (Latin and Cyrillic).
  • Analyze how resource level and genealogical proximity influence tokenization efficacy.

提出的方法

  • Systematic comparison of BPE, Unigram, and Overlap-based BPE (OBPE) across six Uralic languages using UD v2 datasets.
  • Train tokenizers on language-specific monolingual data with a fixed vocabulary size (5,000 subword units).
  • Evaluate downstream POS tagging with two architectures (BiLSTM-CRF and Flair) in cross-lingual transfer setup (source language -> target language).
  • Use three-stage preprocessing: gold-standard extraction, greedy alignment, and first-subword tagging to project labels onto subword sequences.
  • Tune OBPE with equal weights for compression and overlap (α = 0.5) and p = −∞ for the generalized mean to maximize minimum shared token frequency.
  • Report Accuracy and Macro-F1 for POS tagging as primary metrics.

实验结果

研究问题

  • RQ1OBPE是否在多种乌拉尔语言中实现比BPE和Unigram更高的形态对齐和词性标注准确率?
  • RQ2跨语言传递的表现如何随系谱接近度和脚本组(拉丁字母 vs 俄文字母)而变化?
  • RQ3在极低资源设置中,Unigram是否因其概率分割和子词正则化而比BPE更有效?
  • RQ4哪类词性类别最易受分词器选择影响(开放类别 vs 封闭类别)?

主要发现

SourceTargetTokenizerBiLSTM-CRF AccBiLSTM-CRF Macro-F1Flair AccFlair Macro-F1
esthunBPE0.80960.70130.95090.7930
esthunUnigram0.78400.66630.94080.7651
esthunOBPE0.84960.73980.96140.7902
estsmeBPE0.77490.75730.90750.8050
estsmeUnigram0.78300.75730.90780.7885
estsmeOBPE0.81520.78500.93730.8390
finhunBPE0.80960.70130.95090.7930
finhunUnigram0.78400.66630.94080.7651
finhunOBPE0.85140.74120.95810.7907
finsmeBPE0.77490.75730.90750.8050
finsmeUnigram0.78300.75730.90780.7885
finsmeOBPE0.80360.79140.92640.8164
ruskpvBPE0.67440.47420.39410.4409
ruskpvUnigram0.74010.52090.91010.6367
ruskpvOBPE0.72070.50220.89300.5693
  • 在大多数语言对中,OBPE在词性标注准确率和Macro-F1方面持续高于BPE和Unigram,唯凯里克字母组中Unigram表现最佳。
  • 在BPE条件下,匈牙利语的准确率与Macro-F1之间存在显著差距,体现罕见形式的长尾问题,OBPE能缓解其低表示。
  • OBPE在开放类别标注上(如北萨米的ADJ和NOUN)有提升,而固定功能类别如PUNCT在分词器之间保持稳定。
  • 俄语→科米-齐里安的俄文字母对比表现出较大性能差距,因 typology distance 与正字法重叠,限制OBPE的跨语言收益。
  • Unigram在孤立的低资源环境下通常提供更具形态忠实性的分词,在数据稀缺时改进动词等形态丰富形式的表现。
  • 词性熵(H)与分词器表现相关:词性分布更高且更均匀的语言(如匈牙利语和北萨米语)在训练数据减少时仍能保持OBPE的增益。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。