Skip to main content
QUICK REVIEW

[论文解读] Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Sabrina J. Mielke, Zaid Alyafeai|arXiv (Cornell University)|Dec 20, 2021
Natural Language Processing Techniques参考文献 174被引用 103
一句话总结

本文综述了在词、子词和字符级方法下的分词、前分词以及开放词汇建模,强调权衡与历史演进,并指出不存在单一银弹解决方案。

ABSTRACT

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.

研究动机与目标

  • Explain the historical development of tokens, tokenization, and pre-tokenization in NLP.
  • Survey approaches that augment word-level models with character information to handle rare and novel words.
  • Describe methods for learning segmentations and open-vocabulary tokenization beyond fixed vocabularies.
  • Discuss subword vocabulary learning, including manual, data-driven, and Bayesian approaches, and their applicability across languages.
  • Highlight the practical implications and ongoing debates about tokenization in multilingual and noisy-text contexts.

提出的方法

  • Trace the evolution from typographic tokens to pre-tokenization and subword units.
  • Describe methods that augment word-level models with spelling or character information to handle OOV words.
  • Detail open-vocabulary language modeling with word+character hybrids and tokenization-aware architectures.
  • Present approaches that learn segmentation as latent variables and compute marginalizations (approximate or exact).
  • Discuss Bayesian non-parametric perspectives for word discovery and segmentation.
  • Summarize subword vocabulary learning strategies, including manually crafted analyzers and data-driven learners.

实验结果

研究问题

  • RQ1What are the historical and contemporary units of text modeled in NLP, and how have tokenization definitions evolved?
  • RQ2How can word-level models be augmented with character information to handle rare and novel words?
  • RQ3What are the viable approaches to open-vocabulary modeling and tokenization beyond fixed vocabularies?
  • RQ4How can segmentation be learned or inferred rather than predefined, and what are the trade-offs of different marginalization strategies?
  • RQ5What are the strengths and limitations of subword vocabulary methods across languages and domains?

主要发现

  • Subword and character-based tokenization methods enable open-vocabulary processing with smaller vocabularies.
  • Word-level models augmented with character information improve handling of noisy text and novel spellings.
  • Segmental and marginalization-based models can induce meaningful token boundaries but vary in training stability and performance.
  • Unsupervised and Bayesian approaches offer principled frameworks for discovering word boundaries and segments.
  • There is no single best tokenization; domain, language, and task shape the choice of units and methods.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。