QUICK REVIEW

[论文解读] RADAR: Robust AI-Text Detection via Adversarial Learning

Xiaomeng Hu, Pin‐Yu Chen|arXiv (Cornell University)|Jul 7, 2023

Hate Speech and Cyberbullying Detection被引用 29

一句话总结

RADAR 以对抗性框架联合训练一个改写器和一个检测器，以在多种 LLM 和数据集的改写下仍然有效地检测 AI 生成文本。

ABSTRACT

Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a robust AI-text detector via adversarial learning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.

研究动机与目标

在日益增多的改写机器文本背景下，提升鲁棒的 AI 文本检测。
提出 RADAR，通过对抗学习联合训练改写器与检测器。
展示 RADAR 在不同 LLM 和数据集上的鲁棒性、可迁移性及性能。
通过对抗训练探索检测器的迁移能力及改写器质量的提升。

提出的方法

使用一个目标冻结的 LLM 将人类文本数据生成的 AI 文本。
训练改写器 G_sigma 将 AI 文本改写以规避检测，使用带熵惩罚的 PPO。
训练检测器 D_phi 区分人类文本与 AI 文本（包含改写输出），使用重加权逻辑损失以处理样本不平衡。
通过 PPO 奖励和逻辑损失迭代更新改写器和检测器，直到验证集上的 AUROC 稳定。
在 4 个数据集、8 个 LLM 上评估检测器性能，包含未见过的改写器（GPT-3.5-Turbo）。
可选地调节一个平衡超参数 lambda，以平衡无改写与改写时的性能。

实验结果

研究问题

RQ1对抗训练的改写器是否会使 AI 文本的可检测性崩溃，以及检测器是否能够被训练以抵抗此类改写？
RQ2RADAR 在多样化的 LLM 和数据集上的表现如何，检测器对未见模型的可迁移性如何？
RQ3对抗性训练是否在不显著牺牲未改写文本检测性能的情况下提升对改写的鲁棒性？
RQ4指令微调对检测器在跨 LLM 传播性上的影响？
RQ5学习到的检测器是否能推广到训练中未见过的改写器？

主要发现

评估方案	Xsum	SQuAD	WP	TOFEL	平均
w/o Paraphraser - log p	0.882	0.868	0.967	0.832	0.887
w/o Paraphraser - rank	0.722	0.752	0.814	0.731	0.755
w/o Paraphraser - log rank	0.902	0.893	0.975	0.847	0.904
w/o Paraphraser - entropy	0.536	0.521	0.296	0.534	0.472
w/o Paraphraser - DetectGPT	0.874	0.790	0.883	0.919	0.867
w/o Paraphraser - OpenAI (RoBERTa)	0.953	0.914	0.924	0.810	0.900
w/o Paraphraser - RADAR	0.934	0.825	0.847	0.820	0.856
RADAR-Seen Paraphraser - log p	0.230	0.156	0.275	0.130	0.198
RADAR-Seen Paraphraser - rank	0.334	0.282	0.357	0.163	0.284
RADAR-Seen Paraphraser - log rank	0.245	0.175	0.281	0.134	0.209
RADAR-Seen Paraphraser - entropy	0.796	0.845	0.763	0.876	0.820
RADAR-Seen Paraphraser - DetectGPT	0.191	0.105	0.117	0.177	0.159
RADAR-Seen Paraphraser - OpenAI (RoBERTa)	0.821	0.842	0.892	0.670	0.806
RADAR-Seen Paraphraser - RADAR	0.920	0.927	0.908	0.932	0.922
RADAR-Unseen Paraphraser - log p	0.266	0.343	0.641	0.438	0.422
RADAR-Unseen Paraphraser - rank	0.433	0.436	0.632	0.342	0.461
RADAR-Unseen Paraphraser - log rank	0.282	0.371	0.632	0.421	0.426
RADAR-Unseen Paraphraser - entropy	0.779	0.710	0.499	0.618	0.651
RADAR-Unseen Paraphraser - DetectGPT	0.360	0.384	0.609	0.630	0.434
RADAR-Unseen Paraphraser - OpenAI (RoBERTa)	0.789	0.629	0.726	0.364	0.627
RADAR-Unseen Paraphraser - RADAR	0.955	0.861	0.851	0.763	0.857

RADAR 在 8 个 LLM 和 4 个数据集上实现鲁棒的 AI 文本检测，在出现改写时优于基线。
检测器在未扰动的 AI 文本上保持竞争性表现，同时获得对改写的鲁棒性。
使用指令微调 LLM 训练的检测器在其他 LLM 上迁移性更好，表明存在成为通用检测器的潜力。
RADAR 检测器在若干情况下对 GPT-4 生成的文本具有强转移性能。
RADAR 作为副产品提升改写器质量，根据类人评估和 iBLEU 分数得到更高质量的改写。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。