[论文解读] Some Like it Hoax: Automated Fake News Detection in Social Networks
本文表明,欺诈识别可以基于对某篇帖子点赞的用户集合来完成,使用逻辑回归与 harmonic crowdsourcing 方法,在少量标注数据的情况下,在一个 Facebook 数据集上达到超过 99% 的准确率。
In recent years, the reliability of information on the Internet has emerged as a crucial issue of modern society. Social network sites (SNSs) have revolutionized the way in which information is spread by allowing users to freely share content. As a consequence, SNSs are also increasingly used as vectors for the diffusion of misinformation and hoaxes. The amount of disseminated information and the rapidity of its diffusion make it practically impossible to assess reliability in a timely manner, highlighting the need for automatic hoax detection systems. As a contribution towards this objective, we show that Facebook posts can be classified with high accuracy as hoaxes or non-hoaxes on the basis of the users who "liked" them. We present two classification techniques, one based on logistic regression, the other on a novel adaptation of boolean crowdsourcing algorithms. On a dataset consisting of 15,500 Facebook posts and 909,236 users, we obtain classification accuracies exceeding 99% even when the training set contains less than 1% of the posts. We further show that our techniques are robust: they work even when we restrict our attention to the users who like both hoax and non-hoax posts. These results suggest that mapping the diffusion pattern of information can be a useful component of automatic hoax detection systems.
研究动机与目标
- 出于社交网络快速传播错误信息的原因,推动自动化欺诈检测。
- 研究点赞一个帖子的人群是否能揭示其欺诈状态。
- 开发两种基于用户-帖子交互数据的分类器。
- 评估方法在页面与社区间的可扩展性与可迁移性。
提出的方法
- 用点赞该帖子的用户集合的二元向量表征每个帖子,并应用逻辑回归来学习用户权重以预测欺诈/非欺诈。
- 将布尔标签 crowdsourcing(harmonic 算法)应用于带有标注训练集的情景,将点赞视为正向投票,并更新 alpha/beta 参数以推断帖子合法性。
- 使用帖子和用户的二部图并进行迭代更新,将信息从有标注的帖子传播到无标注的帖子。
- 在逻辑回归中,权重 w_u 编码每个用户喜欢非欺诈帖与欺诈帖的倾向。
- 在 harmonic BLC 中,初始化已知为欺诈和已知为非欺诈的帖子,然后通过 alpha/beta 计数迭代更新用户与帖子的信念。
实验结果
研究问题
- RQ1是否可以基于互动用户集合(点赞这些帖子)来识别欺诈?
- RQ2手动标注训练集规模如何影响分类准确率?
- RQ3方法在跨 Facebook 页面(社区)的信息迁移性如何?
- RQ4在相对混合的用户社区(交集数据集)下,这些方法的鲁棒性如何?
主要发现
| Experiment | One-page-out Avg accuracy | One-page-out Stdev | Half-pages-out Avg accuracy | Half-pages-out Stdev |
|---|---|---|---|---|
| 逻辑回归 | 0.794 | 0.303 | 0.716 | 0.143 |
| Harmonic BLC | 0.991 | 0.023 | 0.993 | 0.002 |
- 两种方法在完整数据集上均获得高准确率,甚至在训练集非常小的情况下也超过 99%。
- harmonic BLC 方法在跨页面迁移方面接近完美的准确率(约 99% 及以上),即使在标注数据有限的情况下也能从其他页面学习。
- 在交集数据集上,当训练数据较少时,逻辑回归优于 harmonic BLC,训练数据为 10% 时准确率约为 90%。
- Harmonic BLC 仅用约 0.5% 的帖子进行标注(约 80 条帖子)即可在完整数据集上达到 >99% 的准确率。
- 这些方法对欺诈页面与非欺诈页面之间的极化和用户社区重叠具有鲁棒性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。