QUICK REVIEW

[论文解读] Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark

Mehmet Bayram, Ali Arda Fincan|ArXiv.org|Feb 10, 2025

Natural Language Processing Techniques被引用 4

一句话总结

论文提出一个框架，用土耳其语作为基准来评估形态丰富语言的分词器，引入像土耳其词汇百分比（TR %）和纯词百分比（Pure %）这样的度量，并显示语言对齐在下游任务中可以胜过单纯的模型规模。

ABSTRACT

Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' (LLMs) ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for systematically evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages. Using a Turkish dataset of 6,200 multiple-choice questions from the Massive Multitask Language Understanding (MMLU) benchmark, the framework assesses tokenizers across five key metrics: vocabulary size, token count, processing time, language-specific token percentages (\%TR), and token purity. These metrics provide a structured approach to evaluating how well tokenizers preserve linguistic structures. While \%TR measures the proportion of valid words in the target language, \%Pure assesses the alignment of tokens with meaningful linguistic units, such as roots and valid morphemes, minimizing semantic fragmentation. The findings reveal that \%TR, introduced as a critical metric, exhibits a stronger correlation with downstream performance (e.g., MMLU scores) than token purity, emphasizing its role in improving model accuracy. Additionally, larger model parameters do not necessarily yield better tokenization quality or enhanced results, highlighting the importance of tailored tokenization strategies that prioritize linguistic alignment. This framework sets a new standard for developing robust tokenization methods optimized for morphologically complex and low-resource languages. Future work will refine morphological analysis, explore domain-specific customizations, and conduct cross-linguistic evaluations to further enhance tokenization practices.

研究动机与目标

在形态丰富且低资源语言（如土耳其语）中，推动对语言学信息的分词需求。
提出一个带有新度量的结构化评估框架来评估分词器。
展示分词质量、语言对齐与下游 MMLU 性能之间的关系。
表明更大的模型并不自动带来更好的分词质量或下游结果。

提出的方法

定义并应用五个评估指标：词汇量、总令牌数、处理时间、特定语言的令牌百分比（%TR）和令牌纯度（%Pure）。
引入并形式化两大关键指标：%TR（有效土耳其语单词的比例）和 %Pure（语义上纯净的令牌比例）。
使用土耳其语 TR-MMLU（TR-MMLU）数据集，共 6,200 道问题，覆盖 62 个部分来评估分词器。
在土耳其语数据上对四种最先进的分词器进行比较，报告 MMLU 分数与语言/计算指标。
分析指标与下游性能之间的相关性，并以相关矩阵和多维图展示。

实验结果

研究问题

RQ1分词策略如何影响土耳其语的语言保真度与下游性能？
RQ2语言特定的分词百分比（%TR）和令牌纯度（%Pure）是否比传统指标如词汇量或令牌计数更能预测 MMLU 结果？
RQ3在形态丰富语言中，较大模型规模是否总与更好的分词质量和下游结果相关？
RQ4在土耳其语 NLP 基准上，是否存在语言信息化分词器超越更大模型的情况？

主要发现

模型	参数量（B）	MMLU 分数（%）	词汇量	令牌计数	处理时间（s）	唯一令牌计数	TR %	Pure %
gemma-2	27.2	72.10	256,000	497,015	2.95	6,383	48.63	37.05
llama-3.1	70.6	70.42	128,256	488,535	3.12	6,823	45.80	30.91
Qwen2.5	7.6	61.68	151,665	561,866	3.31	5,752	40.33	30.15
aya-expanse	32.3	70.66	255,029	434,526	2.77	8,562	50.67	32.96

Gemma-2 取得最大 MMLU 分数（72.10%）和最高 Pure %（37.05%），TR % 为 48.63%。
Aya-expanse 记录最高的 TR %（50.67%）并且 MMLU 分数竞争力强（70.66%）。
Llama-3.1 展现平衡，MMLU 为 70.42%，TR % 为 45.80%，但 Pure % 较低，为 30.91%。
Qwen2.5（7.6B 参数）具有最低的 MMLU 分数（61.68%）和 TR %（40.33%），但词汇量更小（151,665）且处理更快（3.31s）。
TR % 与 MMLU 的相关性最强（r = 0.90），其次是 Pure %（r = 0.68）；较大词汇量与 TR %（r = 0.77）和 Pure %（r = 0.82）相关。
过多的令牌计数和处理时间与语言学指标呈负相关（r = -0.93 和 r = -0.60）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。