QUICK REVIEW

[论文解读] Can AI-Generated Text be Reliably Detected?

Vinu Sankar Sadasivan, Aounon Kumar|arXiv (Cornell University)|Mar 17, 2023

Adversarial Robustness in Machine Learning被引用 148

一句话总结

这篇论文表明当前的 AI 文本检测器（基于水印的、零样本的，以及基于检索的）对改写很脆弱，并且从理论上讲，随着语言模型能力的提升，检测的可靠性会下降。它提供了经验性攻击和一个不可行性结果，阐明了基本极限。

ABSTRACT

Large Language Models (LLMs) perform impressively well in various applications. However, the potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use. Consequently, the reliable detection of AI-generated text has become a critical area of research. AI text detectors have shown to be effective under their specific settings. In this paper, we stress-test the robustness of these AI text detectors in the presence of an attacker. We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero shot classifiers, and retrieval-based detectors. Our experiments conducted on passages, each approximately 300 tokens long, reveal the varying sensitivities of these detectors to our attacks. Our findings indicate that while our recursive paraphrasing method can significantly reduce detection rates, it only slightly degrades text quality in many cases, highlighting potential vulnerabilities in current detection systems in the presence of an attacker. Additionally, we investigate the susceptibility of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text as AI-generated. We demonstrate that an attacker can infer hidden AI text signatures without white-box access to the detection method, potentially leading to reputational risks for LLM developers. Finally, we provide a theoretical framework connecting the AUROC of the best possible detector to the Total Variation distance between human and AI text distributions. This analysis offers insights into the fundamental challenges of reliable detection as language models continue to advance. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.

研究动机与目标

评估现有 AI 生成文本检测器（基于水印、零样本和基于检索的）的可靠性。
演示在不显著损害文本质量的情况下降低检测器性能的改写攻击。
通过人类文本和 AI 文本分布之间的全变差距离，给出检测的理论极限。

提出的方法

对带水印和非水印文本，使用轻量级改写器（基于 PEGASUS 与基于 T5 的）对改写攻击进行实证评估。
递归改写（可多轮）以测试不同检测器的鲁棒性（软水印、零-shot、基于神经网络的检测器，以及基于检索的检测器）。
对基于检索的防御进行评估，针对递归改写和基于改写的欺骗攻击。
推导出一个不可行性界限，将 AUROC 与人类文本和 AI 文本分布之间的全变差距离联系起来。
将不可行性结果扩展到文本生成中的伪随机性与真正随机性。
欺骗分析，其中对手推断隐藏签名以降低检测器可信度。

Figure 1 : An illustration of vulnerabilities of existing AI-text detectors. We consider both watermarking-based [ 1 ] and non-watermarking-based detectors [ 2 , 3 , 4 ] and show that they are not reliable in practical scenarios. Colored arrow paths show the potential pipelines for adversaries to av

实验结果

研究问题

RQ1在实际改写或欺骗攻击下，当前的检测器是否能可靠地区分 AI 生成的文本？
RQ2改写与递归改写如何影响水印、零样本、基于神经网络的检测器以及基于检索的防御的准确性？
RQ3随着大型语言模型能力的提升，检测 AI 生成文本的基本极限是什么？
RQ4生成中的伪随机性如何影响可检测性和检测器性能？
RQ5欺骗攻击是否会削弱水印和检测器的可信度，在什么条件下？

主要发现

改写攻击显著降低水印、零样本与基于神经网络的检测器的性能（例如水印准确率从 97% 降至 80%；零样本 AUROC 从 96.5% 降至 25.2%）。
递归改写使基于检索的检测器准确率从 100% 降至在 1% 偽阳性率下的 25%；水印和零样本检测器也遭遇严重下降。
一个不可行性结果表明 AUROC(D) ≤ 1/2 + TV(M, H) − TV(M, H)^2/2，意味着当分布趋于一致时，检测方法趋近于随机猜测；在伪随机情况下，ε 可忽略不计。
实证估计表明人类文本与 GPT-3 模型输出之间的全变差随模型增大而降低，支持理论极限。
欺骗攻击中，对手学习水印签名或利用语义检索可能使人类文本被检测为 AI 生成，削弱检测器信任。
结果呼吁在真实世界部署检测器之前保持谨慎，并进行严格的独立评估。

Figure 2 : Accuracy of the soft watermarking detector on paraphrased LLM outputs plotted against perplexity. The lower the perplexity is, the better the quality of the text is.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。