QUICK REVIEW

[Paper Review] Raidar: geneRative AI Detection viA Rewriting

Chengzhi Mao, Carl Vondrick|arXiv (Cornell University)|Jan 23, 2024

Topic Modeling5 citations

TL;DR

Raidar detects AI-generated text by prompting LLMs to rewrite the input and measuring how much the text changes, using invariance, equivariance, and uncertainty signals to improve detection across domains and models.

ABSTRACT

We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. This tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. We dubbed our geneRative AI Detection viA Rewriting method Raidar. Raidar significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.

Motivation & Objective

Motivate robust detection of machine-generated text amid advancing LLM capabilities.
Introduce a rewriting-based detection paradigm that does not rely on high-dimensional features.
Leverage symbolic (word-level) outputs and editing-distance metrics to distinguish human vs. machine text.
Demonstrate cross-domain and cross-model robustness, including black-box LLMs and unseen generators.

Proposed method

Prompt LLMs with rewriting prompts to obtain a rewritten version of the input text.
Compute invariance, equivariance, and output-uncertainty metrics from the original and rewritten text.
Operate on discrete word-symbol outputs to avoid reliance on continuous feature spaces.
Measure editing distance between original and rewritten text using Levenshtein-based ratio and bag-of-words edits.
Train a binary classifier (logistic regression or XGBoost) on the rewriting-based features.
Show robustness against adversarial prompts by training on multiple prompts.

Experimental results

Research questions

RQ1Can rewriting-based signals (invariance/equivariance/uncertainty) reliably distinguish machine- from human-generated text across domains?
RQ2Do these signals generalize across different language models and rewriting prompts, including black-box LLMs?
RQ3How does input length affect detection performance, and can the method withstand adversarial attempts to bypass detectors?
RQ4What is the impact of different rewriting models (Ada, Text-Davinci-002, GPT-3.5-turbo) on detection efficacy?
RQ5Is the approach robust in out-of-distribution scenarios where the test model differs from the training models?

Key findings

Raidar substantially improves detection performance over state-of-the-art baselines, with gains up to 29 F1 points on several datasets.
The method remains effective when detecting text from unseen or different generation models (OOD settings) with notable improvements (up to 32 points).
Using a single rewriting prompt with GPT-3.5-turbo yields strong detection performance; larger rewriting models further boost results.
Detection remains robust across domains (news, creative writing, student essays, code, Yelp, arXiv abstracts) and even when prompts are tailored to evade detection.
Longer inputs generally improve detection performance, and the approach achieves reasonable F1 scores even for short inputs (as low as ten words).
Training with multiple prompts enhances robustness against adversarial rephrasing attempts.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.