QUICK REVIEW

[论文解读] Few-Shot Detection of Machine-Generated Text using Style Representations

Rafael Rivera Soto, Kailin Koch|arXiv (Cornell University)|Jan 12, 2024

Natural Language Processing Techniques被引用 6

一句话总结

本论文提出了一种基于少样本的机器生成文本检测方法，该方法使用在人工撰写数据上训练的风格表示，能够检测到未见过的大语言模型，甚至仅用少量示例就能识别生成模型。

ABSTRACT

The advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human author. Some previous approaches to this problem have relied on supervised methods by training on corpora of confirmed human- and machine- written documents. Unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of newer language models producing still more fluent text than the models used to train the detectors. Other approaches require access to the models that may have generated a document in question, which is often impractical. In light of these challenges, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. Instead, we propose to leverage representations of writing style estimated from human-authored text. Indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state-of-the-art large language models like Llama-2, ChatGPT, and GPT-4. Furthermore, given a handful of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model generated a given document. The code and data to reproduce our experiments are available at https://github.com/LLNL/LUAR/tree/main/fewshot_iclr2024.

研究动机与目标

在分布漂移和未见模型存在的情况下，推动对机器生成文本的鲁棒检测。
提出能够捕捉与主题或领域无关的写作风格的风格表示。
开发一个少样本检测框架，利用目标LLM的少量样本来检测并归因生成文本。
在单目标和多目标设置下进行评估，并与零样本及其他少样本基线进行比较。

提出的方法

定义将少量文档映射到固定维向量的风格表示 f。
在大规模人工撰写语料上用对比目标训练 f，以捕捉时间不变的写作风格。
使用聚合风格表示的余弦相似度来对新样本与目标模型进行评分。
将其应用于少样本检测，通过评估目标模型的支持集与查询样本之间的相似性。
尝试风格模型的变体（UAR、CISR）以及领域/LLM 的组合。
与零样本检测器及其他少样本基线（ProtoNet、MAML、SBERT）进行对比。

实验结果

研究问题

RQ1从人类写作中学习的风格表示是否能够跨越未见的LLM区分人类与机器作者？
RQ2在少样本情境下，需要多少示例才能可靠地检测出机器生成文本？
RQ3多领域和多LLM训练的风格表示是否可以提升检测与模型归因？
RQ4基于风格的检测器对改写攻击和多目标LLM的鲁棒性如何？

主要发现

Method	Training	pAUC	Dataset	N=5	N=10
UAR	Reddit (5M)	0.905 (0.001)	-	0.905	0.981
UAR	Reddit (5M), Twitter, StackExchange	-	-	0.886 (0.001)	0.968 (0.001)
UAR	AAC, Reddit (politics)	-	-	0.877 (0.001)	0.940 (0.0013)
CISR	Reddit (hard neg/hard pos)	-	-	0.839 (0.001)	0.933 (0.0013)
RoBERTa (ProtoNet)	AAC, Reddit (politics)	-	-	0.871 (0.001)	0.9475 (0.0014)
RoBERTa (MAML)	AAC, Reddit (politics)	-	-	0.662 (0.006)	0.685 (0.0068)
SBERT	Multiple	-	-	0.621 (0.002)	0.716 (0.0022)
AI Detector (fine-tuned)	AAC, Reddit (politics)	-	-	0.6510 (0.031)	0.659 (0.032)
AI Detector	WebText, GPT2-XL	-	-	0.603 (0.025)	0.601 (0.0249)
Rank (GPT2-XL)	BookCorpus, WebText	-	-	0.569 (0.015)	0.558 (0.017)
LogRank (GPT2-XL)	BookCorpus, WebText	-	-	0.764 (0.036)	0.775 (0.038)
Entropy (GPT2-XL)	BookCorpus, WebText	-	-	0.4984 (0.0005)	0.4977 (0.0002)

风格表示能够在少量示例的条件下对未见LLM的文本进行可靠检测。
在低FPR区域的pAUC方面，使用Reddit等多领域数据扩展的UAR风格表示超越 ProtoNet、CISR、SBERT 和零样本检测器等基线。
包含额外的LLM生成数据（AAC）和多LLM训练提升对改写攻击的鲁棒性。
ProtoNet等基于度量的检测在多LLM检测设置中可能效果较差，而基于风格的方法保持强劲性能。
最佳的单目标和多目标结果在有利设置下 pAUC 值约为 0.90 以上，且在使用多LLM风格训练时对改写增强展现鲁棒性。
作者发布数据集并报告可重复性细节，显示实际部署的可行性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。