QUICK REVIEW

[论文解读] The Gray Area: Characterizing Moderator Disagreement on Reddit

Shayan Alipour, Shruti Phadke|arXiv (Cornell University)|Jan 4, 2026

Hate Speech and Cyberbullying Detection被引用 0

一句话总结

论文分析了 Reddit 上的灰区审核，显示约有1/7的案件存在管理员之间的争议，自动化行动常被人工取消并纠错，且大语言模型在灰区内容上的表现比在无争议案件上更弱。

ABSTRACT

Volunteer moderators play a crucial role in sustaining online dialogue, but they often disagree about what should or should not be allowed. In this paper, we study the complexity of content moderation with a focus on disagreements between moderators, which we term the ``gray area'' of moderation. Leveraging 5 years and 4.3 million moderation log entries from 24 subreddits of different topics and sizes, we characterize how gray area, or disputed cases, differ from undisputed cases. We show that one-in-seven moderation cases are disputed among moderators, often addressing transgressions where users' intent is not directly legible, such as in trolling and brigading, as well as tensions around community governance. This is concerning, as almost half of all gray area cases involved automated moderation decisions. Through information-theoretic evaluations, we demonstrate that gray area cases are inherently harder to adjudicate than undisputed cases and show that state-of-the-art language models struggle to adjudicate them. We highlight the key role of expert human moderators in overseeing the moderation process and provide insights about the challenges of current moderation processes and tools.

研究动机与目标

量化在管理员之间存在争议的审核案例（灰区）相对于无争议案例的频率。
按人工与自动化管理员参与将灰区特征进行细分。
考察管理员的理由、内容特征及导致争议的动态。
评估基于LLM的灰区案例审核的可行性与局限性。
为改进管理员工具与治理透明度提供见解。

提出的方法

分析一个包含来自24个子论坛、5年间430万次审核行动的OpenModLog纵向数据集。
将灰区定义为跨多个管理员存在两次及以上相互冲突的审核行动，最后一个行动视为真实结果。
将案例分层为四个层次：灰区-人、灰区-机器人、无争议-人、无争议-机器人。
使用Bies及相关分类法对管理员理由进行注释；将自由文本原因映射到472个聚类和七个主题类别。
对经偏差处理的评论文本应用主题建模（BERTTopic，带去偏）以识别灰区内容主题。
使用点对信息量（PVI）评估文本难度，并对六个LLMs进行对齐最终审核决定的基准测试。
在确定性提示下对不同层次进行LLM评估，以衡量宏观F1及自举置信区间。

Figure 1: (a) An example showing multiple moderation actions on a single comment. (b) Cases are partitioned into four mutually exclusive strata by (i) whether there is within-case disagreement defined as more than one unique moderator and more than one unique action and (ii) whether any moderator is

实验结果

研究问题

RQ1在多个子论坛中，Reddit上的灰区审核案件的普遍性及组成是什么？
RQ2灰区案件在人工与自动系统审核时有何不同？
RQ3驱动灰区争议的理由与内容特征有哪些？
RQ4当前的LLMs能否在灰区内容上与最终管理员决策保持一致？
RQ5在灰区与无争议的审核决策中，LLM的表现受哪些因素影响？

主要发现

灰区审核较为常见，约占所有审核行动的13.54%（n=578,251，4,272,178中）。
在灰区案件中，53.86%仅涉及人工管理员，46.14%至少包含一个机器人管理员。
初始机器人行动以移除为主（95.49%），而人工的模式更为平衡（批准68.67%、移除31.33%）。
在93.46%的机器人→人工序列中，初始的机器人移除随后被人工批准，表明机器人初步过度审核被人工纠正。
经验丰富的人工管理员倾向于推翻经验较少的同行的决定，推翻的管理员平均资历高出约51天（95% CI [37,65]）。
LLMs在灰区-机器人与无争议-人工情形下的对齐度高于灰区-人工与无争议-机器人情形；LLMs的宏观F1在无争议-人工约为0.61，在灰区-机器人约为0.59，但在灰区-人工与无争议-机器人约为0.51。

Figure 2: Share of moderation cases by stratum (gray vs. undisputed) across rule categories. Gray-area cases are relatively overrepresented in trolling, brigading, and doxxing, while spam, link-only, and formatting violations make up a larger share of undisputed cases.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。