[Paper Review] Raidar: geneRative AI Detection viA Rewriting
Raidar detects AI-generated text by prompting LLMs to rewrite the input and measuring how much the text changes, using invariance, equivariance, and uncertainty signals to improve detection across domains and models.
We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. This tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. We dubbed our geneRative AI Detection viA Rewriting method Raidar. Raidar significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.
Motivation & Objective
- Motivate robust detection of machine-generated text amid advancing LLM capabilities.
- Introduce a rewriting-based detection paradigm that does not rely on high-dimensional features.
- Leverage symbolic (word-level) outputs and editing-distance metrics to distinguish human vs. machine text.
- Demonstrate cross-domain and cross-model robustness, including black-box LLMs and unseen generators.
Proposed method
- Prompt LLMs with rewriting prompts to obtain a rewritten version of the input text.
- Compute invariance, equivariance, and output-uncertainty metrics from the original and rewritten text.
- Operate on discrete word-symbol outputs to avoid reliance on continuous feature spaces.
- Measure editing distance between original and rewritten text using Levenshtein-based ratio and bag-of-words edits.
- Train a binary classifier (logistic regression or XGBoost) on the rewriting-based features.
- Show robustness against adversarial prompts by training on multiple prompts.
Experimental results
Research questions
- RQ1Can rewriting-based signals (invariance/equivariance/uncertainty) reliably distinguish machine- from human-generated text across domains?
- RQ2Do these signals generalize across different language models and rewriting prompts, including black-box LLMs?
- RQ3How does input length affect detection performance, and can the method withstand adversarial attempts to bypass detectors?
- RQ4What is the impact of different rewriting models (Ada, Text-Davinci-002, GPT-3.5-turbo) on detection efficacy?
- RQ5Is the approach robust in out-of-distribution scenarios where the test model differs from the training models?
Key findings
- Raidar substantially improves detection performance over state-of-the-art baselines, with gains up to 29 F1 points on several datasets.
- The method remains effective when detecting text from unseen or different generation models (OOD settings) with notable improvements (up to 32 points).
- Using a single rewriting prompt with GPT-3.5-turbo yields strong detection performance; larger rewriting models further boost results.
- Detection remains robust across domains (news, creative writing, student essays, code, Yelp, arXiv abstracts) and even when prompts are tailored to evade detection.
- Longer inputs generally improve detection performance, and the approach achieves reasonable F1 scores even for short inputs (as low as ten words).
- Training with multiple prompts enhances robustness against adversarial rephrasing attempts.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.