QUICK REVIEW

[论文解读] Stylometry Analysis of Human and Machine Text for Academic Integrity

Hezam Albaqami, Muhammad Asif Ayub|arXiv (Cornell University)|Jan 3, 2026

Academic integrity and plagiarism被引用 0

一句话总结

该论文提出一种基于风格计量学的NLP框架，利用 Gemini 生成的机器文本嵌入到人类撰写的数据中，解决四项任务：区分机器文本与人类文本、单一作者与多作者分类、在多作者文档中的作者变更检测，以及作者识别；数据集和代码公开可获取。

ABSTRACT

This work addresses critical challenges to academic integrity, including plagiarism, fabrication, and verification of authorship of educational content, by proposing a Natural Language Processing (NLP)-based framework for authenticating students' content through author attribution and style change detection. Despite some initial efforts, several aspects of the topic are yet to be explored. In contrast to existing solutions, the paper provides a comprehensive analysis of the topic by targeting four relevant tasks, including (i) classification of human and machine text, (ii) differentiating in single and multi-authored documents, (iii) author change detection within multi-authored documents, and (iv) author recognition in collaboratively produced documents. The solutions proposed for the tasks are evaluated on two datasets generated with Gemini using two different prompts, including a normal and a strict set of instructions. During experiments, some reduction in the performance of the proposed solutions is observed on the dataset generated through the strict prompt, demonstrating the complexities involved in detecting machine-generated text with cleverly crafted prompts. The generated datasets, code, and other relevant materials are made publicly available on GitHub, which are expected to provide a baseline for future research in the domain.

研究动机与目标

解决因 AI 生成内容带来的学术诚信挑战（抄袭、作者身份验证）。
提出基于风格计量学的框架，涵盖四项任务：机器文本与人类文本、单一作者与多作者分类、作者变更检测，以及作者识别。
通过将 Gemini 生成的机器文本嵌入到人类作者文档中，创建大规模基准数据集。
提供公开的数据集和代码，为未来教育领域的风格计量学研究建立基线。）

提出的方法

通过在两种提示（普通与严格）下生成机器文本并将其嵌入到人类作者的文档中来创建数据集。
包括空值/重复项检测和基于类别权重的平衡等数据预处理。
使用四个 transformer 模型进行文本分类（BERT-base、ALBERT、DistilBERT、RoBERTa），并结合微调和 dropout 配置。
四项评估任务：机器文本 vs 人类文本、单一作者 vs 多作者文档、作者变更检测，以及作者识别（多标签）。
超参数和训练设置包括批量大小 32、梯度累积 x2、学习率 1e-5 伴随余弦衰减与热身步数、权重衰减 0.01、5 次训练、fp16，以及 70/15/15 的训练/验证/测试拆分。

实验结果

研究问题

RQ1机器文本是否能在不同提示和数据集上被可靠地区分于人类撰写文本？
RQ2在嵌入机器文本到人类作品时，单一作者与多作者文档能否被准确分类？
RQ3在多作者文档中，哪些段落涉及作者变更？
RQ4能否识别多作者文档中的个体作者（包括把AI作为作者）？
RQ5提示设计和嵌入策略如何影响风格计量检测的性能？

主要发现

机器文本 vs 人类文本分类在所有模型和提示下均接近完美的准确率（接近 0.999）。
单一作者 vs 多作者分类也表现出高准确性，对嵌入的机器文本具有鲁棒性。
作者变更检测的表现仍具挑战性，普通提示与严格提示下的 F1-score 大约在 0.68–0.70。
在多作者文档中的作者识别总体 F1-score 较低且存在明显的类别不均衡，某些作者能被可靠检测（如作者 1），而其他作者较差（如作者 2–4）。
提示设计（普通 vs 严格）显著影响机器文本的可检测性以及嵌入后向量的类内/类间可分性。
数据集和代码的公开可用性（GitHub）为未来学术诚信风格计量研究提供基线。

Figure 2: Flowchart of the Data generation process. The same flowchart is used for both datasets, with only differences in the instruction sets.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。