QUICK REVIEW

[论文解读] Early Prediction for Merged vs Abandoned Code Changes in Modern Code Reviews

Md. Khairul Islam, Toufique Ahmed|arXiv (Cornell University)|Dec 6, 2019

Software Engineering Research参考文献 59被引用 22

一句话总结

该论文提出 PredCR，一种基于 LightGBM 的机器学习模型，可早期预测现代代码评审中代码变更是否会被合并或放弃。该模型基于三个开源项目的 146,612 个代码变更进行训练，涵盖评审者、作者、项目、文本和代码五个维度的 25 个特征，平均 AUC 达到 85%，在预测性能上相比先前工作提升 14%-23%，同时减少了对新贡献者的偏见。

ABSTRACT

The modern code review process is an integral part of the current software development practice. Considerable effort is given here to inspect code changes, find defects, suggest an improvement, and address the suggestions of the reviewers. In a code review process, usually, several iterations take place where an author submits code changes and a reviewer gives feedback until is happy to accept the change. In around 12% cases, the changes are abandoned, eventually wasting all the efforts. In this research, our objective is to design a tool that can predict whether a code change would be merged or abandoned at an early stage to reduce the waste of efforts of all stakeholders (e.g., program author, reviewer, project management, etc.) involved. The real-world demand for such a tool was formally identified by a study by Fan et al. [1]. We have mined 146,612 code changes from the code reviews of three large and popular open-source software and trained and tested a suite of supervised machine learning classifiers, both shallow and deep learning based. We consider a total of 25 features in each code change during the training and testing of the models. The best performing model named PredCR (Predicting Code Review), a LightGBM-based classifier achieves around 85% AUC score on average and relatively improves the state-of-the-art [1] by 14-23%. In our empirical study on the 146,612 code changes from the three software projects, we find that (1) The new features like reviewer dimensions that are introduced in PredCR are the most informative. (2) Compared to the baseline, PredCR is more effective towards reducing bias against new developers. (3) PredCR uses historical data in the code review repository and as such the performance of PredCR improves as a software system evolves with new and more data.

研究动机与目标

为减少代码评审中的无效工作，其中约 12% 的变更经过多次迭代后仍被放弃。
开发一种早期预测工具，在完整评审周期结束前识别出可能被合并或放弃的代码变更。
解决先前工作中存在的局限性，包括预测时间过晚、缺乏历史上下文以及对新贡献者的偏见。
通过在统一模型中整合评审者、作者、项目、文本和代码特征，提升预测准确性。

提出的方法

作者从三个大型开源项目（Eclipse、LibreOffice、GerritHub）的 Gerrit 基础代码评审中挖掘了 146,612 个代码变更。
他们提取了五个维度的 25 个特征：评审者（如评审者经验、过往评审数量）、作者（如贡献历史）、项目（如项目规模、评审频率）、文本（如补丁消息长度、情感分析）和代码（如修改行数、复杂度）。
采用监督学习方法，使用六种分类器进行训练：五种浅层模型（逻辑回归、SVM、随机森林、XGBoost、LightGBM）和一种深度神经网络。
通过纵向交叉验证进行模型训练与测试，以防止数据泄露，并确保早期预测场景的真实性。
通过平衡分类损失处理类别不平衡问题，并进行超参数调优以优化性能。
基于 AUC 及其他指标，最终选择基于 LightGBM 的 PredCR 模型作为表现最佳的模型。

实验结果

研究问题

RQ1PredCR 是否能在预测合并与放弃的代码变更方面优于当前最先进基线（Fan et al. [1]）？
RQ2各特征维度（评审者、作者、项目、文本、代码）在预测代码变更结果方面的有效性如何？
RQ3PredCR 在处理对贡献历史有限的新开发者时，是否能有效缓解偏见？
RQ4随着系统随时间积累更多历史数据，模型性能如何演变？
RQ5当应用于训练数据有限的新项目或子项目时，PredCR 的鲁棒性如何？

主要发现

PredCR 达到平均 AUC 得分为 85%，相比最先进基线（Fan et al. [1]）实现了 14%-23% 的相对性能提升。
评审者维度是信息量最高的特征集合，对预测性能贡献最大，其次是作者、项目、代码和文本特征。
与基线相比，PredCR 减少了对新开发者的偏见，在作者历史记录有限的情况下仍保持强劲表现。
随着更多历史数据的积累，模型性能随时间推移而提升，表明 PredCR 可从不断演化的代码评审数据仓库中持续获益。
当在单一项目上重新训练并应用于其他项目时，PredCR 仍保持优异性能，表明其在不同项目间具备良好的可迁移性与泛化能力。
即使在每次修订后更新预测结果，模型依然保持有效性，显示出在动态评审工作流中的强鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。