[Paper Review] A parallel corpus of Python functions and documentation strings for automated code documentation and code generation
This paper presents a large parallel corpus of Python functions with their docstrings scraped from GitHub, plus baselines for code documentation and code generation using neural MT and data augmentation, and releases these resources for research.
Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings ("docstrings") generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data. We release our datasets and processing scripts in order to stimulate research in these areas.
Motivation & Objective
- Motivate the need for large, realistic code-language corpora to advance automated code documentation and generation.
- Create a diverse parallel corpus of Python functions with docstrings from GitHub and accompanying metadata.
- Provide baseline neural machine translation results for code-to-docstring and docstring-to-code tasks on the corpus.
- Explore data augmentation via synthetic docstrings to enhance training data.
- Release datasets, preprocessing scripts, and baseline configurations to the research community.
Proposed method
- Scrape GitHub to extract Python 2.7 code, splitting into function declarations, docstrings, and bodies.
- Filter and preprocess to produce parallel corpora and a separate code-only corpus with synthetic docstrings.
- Tokenize with Moses and apply Byte-Pair Encoding to reduce sparsity.
- Train neural machine translation models in both directions (code-to-docstring and docstring-to-code) using Nematus with specific hyperparameters.
- Apply backtranslation by generating synthetic docstrings on code-only data and retraining.
Experimental results
Research questions
- RQ1Can a large, diverse Python function-docstring corpus support effective learning for code documentation and code generation tasks?
- RQ2How do neural MT baselines perform on code-to-docstring and docstring-to-code tasks on this data?
- RQ3Does backtranslation and synthetic docstring augmentation improve performance?
- RQ4What are the baseline BLEU scores indicating about task difficulty on this corpus?
Key findings
- The main parallel corpus contains 150,370 function declarations, docstrings, and bodies with 109,108 training, 2,000 validation, and 2,000 test examples.
- Code-to-docstring baseline BLEU: 14.03 (valid) and 13.84 (test).
- Docstring-to-code baseline BLEU (base): 10.32 (valid) and 10.24 (test).
- Docstring-to-code with backtranslation BLEU: 10.85 (valid) and 10.90 (test).
- Backtranslation provides a moderate improvement of about 0.5–0.6 BLEU points over the base docstring-to-code model.
- The dataset is more challenging than previously published Python corpora (e.g., BLEU scores in the 60–85 range for other datasets), indicating realistic complexity.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.