QUICK REVIEW

[Paper Review] Modeling Vocabulary for Big Code Machine Learning

Hlib Babii, Andrea Janes|arXiv (Cornell University)|Apr 3, 2019

Software Engineering Research68 references25 citations

TL;DR

This paper investigates critical vocabulary modeling choices for training large-scale neural language models (NLMs) on source code, demonstrating that techniques like Byte-Pair Encoding (BPE) and strategic token filtering reduce vocabulary size by up to three orders of magnitude. The authors successfully train competitive NLMs on a corpus of 10,106 projects—achieving 27 projects per minute—enabling scalable pretraining and transfer learning for code intelligence tasks.

ABSTRACT

When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling choices for source code vocabulary, and explores their impact on the resulting vocabulary on a large-scale corpus of 14,436 projects. We show that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects.

Motivation & Objective

Address the challenge of managing extremely large and diverse source code vocabularies in machine learning models.
Investigate how different vocabulary modeling decisions impact vocabulary size, out-of-vocabulary (OOV) rates, and training feasibility.
Enable scalable training of neural language models (NLMs) on large-scale code corpora by identifying optimal configuration choices.
Demonstrate that with proper vocabulary modeling, NLMs can be trained efficiently and achieve competitive performance on code completion and language modeling tasks.

Proposed method

Analyze 14,436 open-source projects to evaluate the impact of vocabulary modeling choices on vocabulary size, token count, and OOV rate.
Apply Byte-Pair Encoding (BPE) to subword-encode identifiers and reduce vocabulary size while preserving semantic meaning.
Use heuristic filtering to exclude non-English source files and rare literals, reducing noise and vocabulary bloat.
Train and evaluate multiple NLMs using LSTM and QRNN architectures on a large-scale corpus of 10,106 projects.
Optimize training efficiency by maintaining a fixed vocabulary, enabling linear scaling with data size and faster fine-tuning.
Evaluate models on both language modeling and code completion tasks to assess performance and generalization.

Experimental results

Research questions

RQ1How do different vocabulary modeling choices—such as token filtering, case handling, and subword segmentation—affect vocabulary size and OOV rates in source code?
RQ2What combination of modeling choices enables training of neural language models on large-scale code corpora (e.g., 10,000+ projects) without vocabulary explosion?
RQ3To what extent can BPE and other subword techniques reduce vocabulary size while maintaining model performance?
RQ4How does a fixed, controlled vocabulary impact training speed and scalability of NLMs on big code datasets?
RQ5Can NLMs trained on such large corpora achieve competitive performance on downstream tasks like code completion?

Key findings

Vocabulary modeling choices can influence vocabulary size by up to three orders of magnitude, with the most impactful decisions being subword segmentation and filtering of rare or non-English tokens.
BPE-based subword tokenization is essential for controlling vocabulary size; simple heuristics like case splitting are insufficient for large-scale training.
The authors successfully trained a neural language model on a corpus of 10,106 projects in under a day, achieving a training rate of 27 projects per minute and 50 source code files per second.
The resulting model achieved competitive performance on both language modeling and code completion tasks, demonstrating the feasibility of large-scale pretraining on code.
A fixed vocabulary enables linear scaling with data size and allows fast fine-tuning—potentially in minutes—making transfer learning practical for code intelligence applications.
The study identifies that advanced techniques like BPE and strategic filtering are necessary to avoid OOV issues and make large-scale NLMs on code feasible.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.