QUICK REVIEW

[论文解读] Evaluating AIGC Detectors on Code Content

Jian Wang, Shangqing Liu|arXiv (Cornell University)|Apr 11, 2023

Software Engineering Research被引用 16

一句话总结

本论文实证评估六种 AIGC 探测器（3 个开源，3 个商业）在 ChatGPT 生成的与代码相关内容上的表现，并与自然语言内容进行比较，以及一个人类研究和数据集构建。

ABSTRACT

Artificial Intelligence Generated Content (AIGC) has garnered considerable attention for its impressive performance, with ChatGPT emerging as a leading AIGC model that produces high-quality responses across various applications, including software development and maintenance. Despite its potential, the misuse of ChatGPT poses significant concerns, especially in education and safetycritical domains. Numerous AIGC detectors have been developed and evaluated on natural language data. However, their performance on code-related content generated by ChatGPT remains unexplored. To fill this gap, in this paper, we present the first empirical study on evaluating existing AIGC detectors in the software domain. We created a comprehensive dataset including 492.5K samples comprising code-related content produced by ChatGPT, encompassing popular software activities like Q&A (115K), code summarization (126K), and code generation (226.5K). We evaluated six AIGC detectors, including three commercial and three open-source solutions, assessing their performance on this dataset. Additionally, we conducted a human study to understand human detection capabilities and compare them with the existing AIGC detectors. Our results indicate that AIGC detectors demonstrate lower performance on code-related data compared to natural language data. Fine-tuning can enhance detector performance, especially for content within the same domain; but generalization remains a challenge. The human evaluation reveals that detection by humans is quite challenging.

研究动机与目标

评估现有 AIGC 探测器对由 ChatGPT 生成的代码相关内容的有效性。
比较探测器在代码内容与自然语言内容上的性能。
探究微调是否能提升探测器在性能和泛化能力上的表现。
评估探测器对内容小变异的鲁棒性。
比较人类检测 AI 生成内容的能力与探测器的差异。

提出的方法

构建两个大规模数据集：CCD（代码相关内容）和 NLCD（自然语言内容），包含人类与 ChatGPT 生成的对例。
在 CCD-Test 和 NLCD-Test 上评估六种探测器（三种开源：GPT2-Detector、DetectGPT、RoBERTa-QA；三种商业：GPTZero、Writer、AITextClassifier）。
使用与领域相关的子集在 NLCD-Train 和 CCD-Train 上微调 RoBERTa-QA，以评估改进。
通过应用代码和文本变异并重新评估探测器来测试鲁棒性。
进行一项在线人类研究，参与者为 50 位经验丰富的开发者，以评估人类检测性能。
以 AUC 作为主要指标，FPR 与 FNR 作为补充指标。

实验结果

研究问题

RQ1RQ1：现有探测器在检测 ChatGPT 生成的代码内容和自然语言内容方面的有效性如何？
RQ2RQ2：微调是否能提高对代码相关数据的探测器性能？
RQ3RQ3：当 ChatGPT 生成的数据稍作修改时，探测器的鲁棒性如何？
RQ4RQ4：人类在辨别由 ChatGPT 生成的内容与探测器相比的能力如何？

主要发现

探测器在代码相关数据上的性能低于自然语言数据。
微调可以提升探测器的性能，但跨领域的泛化仍然有限。
在探测器之间，AUC、FPR 和 FNR 在不同数据集与语言中存在不同的权衡。
鲁棒性测试表明探测器在变异下性能下降；某些探测器在特定内容类型下表现优于其他。
人类也发现检测 ChatGPT 生成的代码内容具有挑战性，反映了探测器的困难。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。