[论文解读] A deep tree-based model for software defect prediction
本文介绍一种直接在 Abstract Syntax Trees 上操作的 tree-structured LSTM,以学习用于缺陷预测的代码表示,在 Samsung 与 PROMISE 数据集上实现了强烈的同项目与跨项目性能。
Defects are common in software systems and can potentially cause various problems to software users. Different methods have been developed to quickly predict the most likely locations of defects in large code bases. Most of them focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and different levels of semantics of source code, an important capability for building accurate prediction models. In this paper, we develop a novel prediction model which is capable of automatically learning features for representing source code and using them for defect prediction. Our prediction system is built upon the powerful deep learning, tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. An evaluation on two datasets, one from open source projects contributed by Samsung and the other from the public PROMISE repository, demonstrates the effectiveness of our approach for both within-project and cross-project predictions.
研究动机与目标
- Motivate defect prediction as essential for prioritizing testing and maintenance in large codebases.
- Propose a deep tree-structured LSTM that matches the AST of source code to preserve syntax and semantics.
- Eliminate manual feature engineering by learning representations directly from ASTs.
- Evaluate the approach on real open-source Samsung projects and the PROMISE repository for both within- and cross-project prediction.
提出的方法
- Parse source files into Abstract Syntax Trees (ASTs).
- Embed AST node labels into fixed-size vectors via an embedding matrix (ast2vec).
- Apply a tree-structured LSTM (Tree-LSTM) that aggregates child representations to produce a root code representation.
- Train the Tree-LSTM in an unsupervised manner by predicting a parent node label from its children (softmax over parent labels).
- Use the learned root representation as input to a conventional classifier (logistic regression or random forest) for defect prediction.
实验结果
研究问题
- RQ1Can a tree-structured LSTM over ASTs effectively capture syntactic and semantic information for defect prediction?
- RQ2Does AST-based representation learning improve within-project and cross-project defect prediction compared to traditional feature-based approaches?
- RQ3What is the impact of classifier choice (Logistic Regression vs Random Forest) on prediction performance across datasets?
主要发现
- Within Samsung dataset, Random Forests with the Tree-LSTM features achieve F-measure, Precision, Recall, and AUC all well above 0.9, with AUC ≈ 0.98.
- Within PROMISE dataset, the approach yields an average AUC of 0.60 and a high recall of 0.86, though precision and F-measure are lower than some baselines.
- Cross-project prediction shows high average recall (≈0.8) across 22 project pairs, with AUC consistently above 0.5, indicating overall effectiveness.
- The study demonstrates the model’s ability to learn from raw ASTs, enabling defect prediction without manual feature engineering and providing potential localization via attention.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。