QUICK REVIEW

[論文レビュー] Few-Shot Detection of Machine-Generated Text using Style Representations

Rafael Rivera Soto, Kailin Koch|arXiv (Cornell University)|Jan 12, 2024

Natural Language Processing Techniques被引用数 6

ひとこと要約

この論文は、人間著作データで訓練されたスタイル表現を用いる機械生成テキストの少数ショット検出法を導入しており、 unseen LLMs を検出するだけでなく、わずかな例から生成モデルを特定することも可能である。

ABSTRACT

The advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human author. Some previous approaches to this problem have relied on supervised methods by training on corpora of confirmed human- and machine- written documents. Unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of newer language models producing still more fluent text than the models used to train the detectors. Other approaches require access to the models that may have generated a document in question, which is often impractical. In light of these challenges, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. Instead, we propose to leverage representations of writing style estimated from human-authored text. Indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state-of-the-art large language models like Llama-2, ChatGPT, and GPT-4. Furthermore, given a handful of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model generated a given document. The code and data to reproduce our experiments are available at https://github.com/LLNL/LUAR/tree/main/fewshot_iclr2024.

研究の動機と目的

ロバストな機械生成テキスト検出を、分布シフトや未知のモデルにも耐性を持たせる動機付け。
トピックやドメインに依存しない執筆スタイルを捉えるスタイルベースの表現を提案。
対象LLMからの少数サンプルを使って検出と属性付けを行う Few-shot 検出フレームワークを開発。
単一ターゲットとマルチターゲット設定で評価し、ゼロショットや他の few-shot ベースラインと比較。

提案手法

小さな文書集合を固定次元ベクトルへ写すスタイル表現 f を定義。
人間著作コーパスで対比学習を用いて f を訓練し、時間不変な執筆スタイルを捉える。
新しいサンプルをターゲットモデルと比較してスタイル表現の aggregated のコサイン類似度をスコア付け。
Few-shot 検出へ適用する際、ターゲットモデルのサポートセットとクエリサンプル間の類似度を評価。
スタイルモデルの変種（UAR, CISR）およびドメイン/LLM の組み合わせを試す。
ゼロショット検出器や他の few-shot ベースライン（ProtoNet, MAML, SBERT）と比較。

実験結果

リサーチクエスチョン

RQ1人間の執筆から学習したスタイル表現は、未知のLLMを横断して人間 vs. 機械著者を識別する generalize できるか？
RQ2少数の例で機械生成テキストを信頼性高く検出するためにはどれくらいのサンプルが必要か？
RQ3マルチドメインおよびマルチLLM 訓練のスタイル表現は検出とモデル属性付けを改善するか？
RQ4パラフレーズ攻撃や複数のターゲットLLM に対するスタイルベース検出器の頑健性は？

主な発見

手法	訓練	pAUC	データセット	N=5	N=10
UAR	Reddit (5M)	0.905 (0.001)	-	0.905	0.981
UAR	Reddit (5M), Twitter, StackExchange	-	-	0.886 (0.001)	0.968 (0.001)
UAR	AAC, Reddit (politics)	-	-	0.877 (0.001)	0.940 (0.0013)
CISR	Reddit (hard neg/hard pos)	-	-	0.839 (0.001)	0.933 (0.0013)
RoBERTa (ProtoNet)	AAC, Reddit (politics)	-	-	0.871 (0.001)	0.9475 (0.0014)
RoBERTa (MAML)	AAC, Reddit (politics)	-	-	0.662 (0.006)	0.685 (0.0068)
SBERT	Multiple	-	-	0.621 (0.002)	0.716 (0.0022)
AI Detector (fine-tuned)	AAC, Reddit (politics)	-	-	0.6510 (0.031)	0.659 (0.032)
AI Detector	WebText, GPT2-XL	-	-	0.603 (0.025)	0.601 (0.0249)
Rank (GPT2-XL)	BookCorpus, WebText	-	-	0.569 (0.015)	0.558 (0.017)
LogRank (GPT2-XL)	BookCorpus, WebText	-	-	0.764 (0.036)	0.775 (0.038)
Entropy (GPT2-XL)	BookCorpus, WebText	-	-	0.4984 (0.0005)	0.4977 (0.0002)

スタイル表現は未知のLLMからのテキストを少数例で信頼性高く検出可能にする。
UAR スタイル表現訓練は Reddit で行われ（マルチドメインデータで拡張）、ProtoNet、CISR、SBERT、ゼロショット検出器を含むベースラインよりも低-FPR域の pAUC で上回る。
追加のLLM生成データ（AAC）とマルチLLM訓練を含めると、パラフレーズ攻撃に対する頑健性が向上。
ProtoNet や他のメトリックベース検出器はマルチLLM検出設定で効果が低い場合がある一方、スタイルベースの手法は強い性能を維持。
最良の単一ターゲットおよびマルチターゲット結果は、好条件で pAUC 値が約 0.90+、マルチLLMスタイル訓練時のパラフレーズ拡張下で頑健性を示す。
著者はデータセットを公開し、再現性の詳細を報告しており、実用的な展開の実現可能性を示唆。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。