QUICK REVIEW

[论文解读] Online Human-Bot Interactions: Detection, Estimation, and Characterization

Onur Varol, Emilio Ferrara|arXiv (Cornell University)|Mar 9, 2017

Spam and Phishing Detection被引用 236

一句话总结

一个框架利用公开数据的1,150个特征检测Twitter机器人，达到高准确率并估计机器人普遍存在于9%到15%之间。

ABSTRACT

Increasing evidence suggests that a growing amount of social media content is generated by autonomous entities known as social bots. In this work we present a framework to detect such entities on Twitter. We leverage more than a thousand features extracted from public data and meta-data about users: friends, tweet content and sentiment, network patterns, and activity time series. We benchmark the classification framework by using a publicly available dataset of Twitter bots. This training data is enriched by a manually annotated collection of active Twitter users that include both humans and bots of varying sophistication. Our models yield high accuracy and agreement with each other and can detect bots of different nature. Our estimates suggest that between 9% and 15% of active Twitter accounts are bots. Characterizing ties among accounts, we observe that simple bots tend to interact with bots that exhibit more human-like behaviors. Analysis of content flows reveals retweet and mention strategies adopted by bots to interact with different target groups. Using clustering analysis, we characterize several subclasses of accounts, including spammers, self promoters, and accounts that post content from connected applications.

研究动机与目标

将使用广泛公开数据和元数据来检测Twitter上的社交机器人账户，构建可扩展框架。
在数据集、模型和日益复杂化的机器人方面对检测准确性进行评估。
在大量讲英语的Twitter用户群体中估计机器人样账户的普遍存在。
描述人类与机器人相似账户之间的社会连接性、信息流和行为聚类。

提出的方法

从六大特征类别的用户元数据、内容、网络结构和时序中提取1,150个特征。
训练监督式机器学习分类器（随机森林、AdaBoost、逻辑回归、决策树），并通过AUC选择最佳，随机森林达到0.95 AUC。
使用人工标注的机器人与人类账户集来测试泛化能力，对训练数据进行标注和扩展。
在诱捕机器人数据和最近手工标注的账户上评估模型，评估跨数据集的性能和阈值选择。
通过最大化在人工标注数据十等分位数的分类准确性来计算机器人分数阈值。

实验结果

研究问题

RQ1一个以大量特征驱动的监督模型是否能够在Twitter上准确区分机器人与人类？
RQ2机器人复杂度如何演变，这如何影响跨数据集的模型表现？
RQ3在大量讲英语的Twitter人群中，机器人样账户的普遍存在率是多少？
RQ4人类与机器人样账户之间的社交连接模式和信息流有哪些？
RQ5账户中出现了哪些行为聚类，每个聚类有什么特征？

主要发现

大规模特征框架实现了高检测性能；在诱捕数据上随机森林达到0.95 AUC。
在人工标注数据上，准确率在低机器人分值十等分位中超过90%，在具有挑战性的中间范围为60–80%，总体按人口加权后准确率为86%。
机器人普遍存在性估计在9%到15%之间，取决于训练数据和阈值选择。
人类主要关注人类并被人类及一些高级机器人关注，而机器人偏好机器人之间的互动且互惠性较低。
聚类揭示十个行为群体；显著群体包括招募/垃圾邮件账户、连接应用发布的账户，以及混合机器人/人类（半机械人）账户。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。