QUICK REVIEW

[论文解读] SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch mode and During Software Development

Vaibhav Saini, Hitesh Sajnani|arXiv (Cornell University)|Mar 5, 2016

Software Engineering Research参考文献 24被引用 24

一句话总结

SourcererCC 是一种可扩展的基于标记的克隆检测工具，利用优化的倒排索引和标记顺序启发式方法，可在标准工作站上高效检测大规模代码库中的1–3类克隆。SourcererCC-I 是一个 Eclipse 插件，支持在软件开发过程中实时、增量式地检测克隆，实现高性能和高准确率，且无需分布式基础设施。

ABSTRACT

Given the availability of large source-code repositories, there has been a large number of applications for large-scale clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that scale to big software systems or large repositories, specifically for detecting near-miss (Type 3) clones where significant editing activities may take place in the cloned code. This paper demonstrates: (i) SourcererCC, a token-based clone detector that targets the first three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. It uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone, and (ii) SourcererCC-I, an Eclipse plug-in, that uses SourcererCC's core engine to identify and navigate clones (both inter and intra project) in real-time during software development. In our experiments, comparing SourcererCC with the state-of-the-art tools, we found that it is the only clone detection tool to successfully scale to 250 MLOC on a standard workstation with 12 GB RAM and efficiently detect the first three types of clones (precision 86% and recall 86-100%). Link to the demo: https://youtu.be/l7F_9Qp-ks4

研究动机与目标

解决现有克隆检测工具在大规模软件系统中缺乏可扩展性、准确率不足，且不支持近似克隆（Type 3）的问题。
在无需分布式计算的前提下，实现在批处理模式和主动软件开发过程中的高效克隆检测。
提供一种与编程语言无关的解决方案，适用于跨项目和项目内代码库，且资源消耗极低。
在集成开发环境（IDE）中支持实时克隆检测，辅助开发人员在编码过程中提高可维护性并降低代码重复风险。
在仅使用标准工作站的前提下，保持大型仓库中1–3类克隆检测的高精确率和高召回率。

提出的方法

使用子块重叠过滤启发式方法构建部分倒排索引，仅对每个代码块的标记子集进行索引，从而减少索引大小和查询开销。
采用标记位置过滤技术，利用标记顺序计算相似度得分的动态上下界，实现对候选克隆的早期拒绝或接受。
采用基于标记的方法并结合语言感知的标记化（目前支持 Java、C、C#），在无需抽象语法树（AST）解析的前提下支持多种编程语言。
将核心 SourcererCC 引擎集成至 Eclipse 插件（SourcererCC-I）中，实现在开发过程中的实时克隆检测。
支持增量索引更新，可在代码更改后快速重新索引，无需完整重新处理。
按项目和文件对检测到的克隆进行分层组织，支持在 IDE 中直观导航。

实验结果

研究问题

RQ1基于标记的克隆检测工具是否能在仅使用标准工作站（12 GB RAM）的前提下，成功扩展至 2.5 亿行代码（MLOC），同时支持 Type 3（近似克隆）？
RQ2基于标记顺序的过滤启发式方法是否能显著减少所需标记比较次数和索引大小，同时不牺牲召回率？
RQ3是否能高效地将实时克隆检测集成至 IDE 中，而不降低性能或影响开发工作流？
RQ4非分布式、与语言无关的工具是否能在可扩展性和主动开发过程中的可用性方面超越现有批处理工具？
RQ5是否可能在单机部署下，保持对大型跨项目仓库中克隆检测的高精确率和高召回率？

主要发现

在 Deckard、CCFinder、NiCad 和 iClones 中，SourcererCC 是唯一一个在标准工作站（12 GB RAM）上成功扩展至 2.5 亿行代码（MLOC）的工具。
SourcererCC 在前三种克隆类型上实现了 86% 的精确率和 86–100% 的召回率，其准确率和可扩展性均优于当前最先进的工具。
过滤启发式方法显著减少了所需代码块比较和标记比较的次数，从而大幅提升性能。
SourcererCC-I 实现了毫秒级的索引创建速度，每 10 万行代码仅需几毫秒。
增量索引更新机制确保了代码更改后的快速响应，使其在持续开发中具有实际可用性。
SourcererCC-I 支持项目内和跨项目的克隆检测，并在 Eclipse IDE 中以分层可导航的树状结构展示检测结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。