[Paper Review] A deep tree-based model for software defect prediction
The paper introduces a tree-structured LSTM that directly operates on Abstract Syntax Trees to learn code representations for defect prediction, with strong within-project and cross-project performance on Samsung and PROMISE datasets.
Defects are common in software systems and can potentially cause various problems to software users. Different methods have been developed to quickly predict the most likely locations of defects in large code bases. Most of them focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and different levels of semantics of source code, an important capability for building accurate prediction models. In this paper, we develop a novel prediction model which is capable of automatically learning features for representing source code and using them for defect prediction. Our prediction system is built upon the powerful deep learning, tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. An evaluation on two datasets, one from open source projects contributed by Samsung and the other from the public PROMISE repository, demonstrates the effectiveness of our approach for both within-project and cross-project predictions.
Motivation & Objective
- Motivate defect prediction as essential for prioritizing testing and maintenance in large codebases.
- Propose a deep tree-structured LSTM that matches the AST of source code to preserve syntax and semantics.
- Eliminate manual feature engineering by learning representations directly from ASTs.
- Evaluate the approach on real open-source Samsung projects and the PROMISE repository for both within- and cross-project prediction.
Proposed method
- Parse source files into Abstract Syntax Trees (ASTs).
- Embed AST node labels into fixed-size vectors via an embedding matrix (ast2vec).
- Apply a tree-structured LSTM (Tree-LSTM) that aggregates child representations to produce a root code representation.
- Train the Tree-LSTM in an unsupervised manner by predicting a parent node label from its children (softmax over parent labels).
- Use the learned root representation as input to a conventional classifier (logistic regression or random forest) for defect prediction.
Experimental results
Research questions
- RQ1Can a tree-structured LSTM over ASTs effectively capture syntactic and semantic information for defect prediction?
- RQ2Does AST-based representation learning improve within-project and cross-project defect prediction compared to traditional feature-based approaches?
- RQ3What is the impact of classifier choice (Logistic Regression vs Random Forest) on prediction performance across datasets?
Key findings
- Within Samsung dataset, Random Forests with the Tree-LSTM features achieve F-measure, Precision, Recall, and AUC all well above 0.9, with AUC ≈ 0.98.
- Within PROMISE dataset, the approach yields an average AUC of 0.60 and a high recall of 0.86, though precision and F-measure are lower than some baselines.
- Cross-project prediction shows high average recall (≈0.8) across 22 project pairs, with AUC consistently above 0.5, indicating overall effectiveness.
- The study demonstrates the model’s ability to learn from raw ASTs, enabling defect prediction without manual feature engineering and providing potential localization via attention.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.