[论文解读] DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection
DPIC 通过将提示派生特征与文本固有特征解耦,使用孪生网络将原始文本与GPT生成的再回答版本进行比较,从而检测机器生成文本。
Large language models (LLMs) have the potential to generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets. Consequently, detecting whether a text is generated by LLMs has become increasingly important. Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics. However, since we do not have access to the interior of the black-box model, we must resort to surrogate models, which impacts detection quality. In order to achieve high-quality detection of black-box models, we would like to extract deep intrinsic characteristics of the black-box model generated texts. We view the generation process as a coupled process of prompt and intrinsic characteristics of the generative model. Based on this insight, we propose to decouple prompt and intrinsic characteristics (DPIC) for LLM-generated text detection method. Specifically, given a candidate text, DPIC employs an auxiliary LLM to reconstruct the prompt corresponding to the candidate text, then uses the prompt to regenerate text by the auxiliary LLM, which makes the candidate text and the regenerated text align with their prompts, respectively. Then, the similarity between the candidate text and the regenerated text is used as a detection feature, thus eliminating the prompt in the detection process, which allows the detector to focus on the intrinsic characteristics of the generative model. Compared to the baselines, DPIC has achieved an average improvement of 6.76\% and 2.91\% in detecting texts from different domains generated by GPT4 and Claude3, respectively.
研究动机与目标
- 在训练数据域之外推动对机器生成文本的鲁棒检测。
- 引入将提示效应与文本固有特征解耦的概念。
- 提出一种由GPT驱动的再回答生成机制,以揭示生成文本的继承性。
- 开发基于孪生嵌入的相似性模块及分类器用于检测。
- 评估对扰动和攻击的鲁棒性,以反映现实世界的使用情况。
提出的方法
- 定义GPT的遗传继承性:大型语言模型输出受训练数据和提示的影响。
- 通过提示GPT模型对原文进行摘要再重新回答,生成再回答文本。
- 使用孪生网络计算高维语义嵌入并测量余弦相似度。
- 将嵌入和相似性融合到分类器中以预测文本是否机器生成。
- 在HC3 上进行训练,并在Wiki、CCNews、CovidCM和ACLAbs数据集上评估泛化。
- 与基于PPL的检测器、DetectGPT以及基于RoBERTa的检测器进行对比;评估对再翻译和打磨攻击的鲁棒性。
实验结果
研究问题
- RQ1是否可以通过利用原始文本与GPT生成的再回答在生成的问题之间的相似性来检测GPT生成的文本?
- RQ2利用高维语义嵌入是否能提升跨领域检测的泛化能力?
- RQ3该方法对常见文本扰动和自适应攻击的鲁棒性如何?
- RQ4GPT-Pat 在多样数据集上的表现与最先进的检测器相比如何?
主要发现
| 数据集 | P_acc | P_prec | P_F1 | D_acc | D_prec | D_F1 | R_acc | R_prec | R_F1 | G_acc | G_prec | G_F1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC3 | 0.9344 | 0.8140 | 0.9943 | 0.9989 | 0.9519 | 0.8036 | 0.9936 | 0.9984 | 0.9341 | 0.8171 | 0.9944 | 0.9989 |
| Wiki | 0.8547 | 0.7155 | 0.8843 | 0.9532 | 0.8721 | 0.7181 | 0.8152 | 0.9348 | 0.8512 | 0.7138 | 0.8958 | 0.9541 |
| CCNews | 0.7156 | 0.7650 | 0.7011 | 0.9337 | 0.6825 | 0.7477 | 0.6304 | 0.9670 | 0.7393 | 0.7729 | 0.7648 | 0.9313 |
| CovidCM | 0.8353 | 0.7192 | 0.9676 | 0.9676 | 0.8758 | 0.7286 | 0.9634 | 0.9903 | 0.8260 | 0.7133 | 0.9678 | 0.9669 |
| ACLAbs | 0.7050 | 0.8859 | 0.8745 | 0.8983 | 0.9692 | 0.9000 | 1.0000 | 1.0000 | 0.5915 | 0.8839 | 0.8571 | 0.8872 |
- 在四个泛化数据集(Wiki、CCNews、CovidCM、ACLAbs)的平均准确度达到0.9457,平均超越基于RoBERTa的检测器12.34%。
- GPT-Pat 在多个数据集上实现更高的精确度(例如 CCNews 的精确度为0.9670),降低误报。
- 结合相似性和嵌入特征的孪生+嵌入分类器在所测试的架构中表现最佳。
- 自适应攻击(再翻译和部分润色)对RoBERTa 的降幅大于对GPT-Pat,表明在实践中更强的鲁棒性。
- GPT-Pat 在HC3上保持最先进的性能,并对域外数据展现出更强的泛化能力。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。