QUICK REVIEW

[论文解读] Entropy-based Classification of 'Retweeting' Activity on Twitter

Rumi Ghosh, Tawan Surachawala|arXiv (Cornell University)|Jun 2, 2011

Spam and Phishing Detection参考文献 17被引用 61

一句话总结

本文提出一种基于熵的、与内容无关的方法，通过时间间隔和用户熵特征对推特上的转发活动进行分类。该方法成功区分了五类不同的活动类型——新闻传播、广告推广、活动推广、机器人行为和寄生广告，实现了无需依赖内容或语言的可扩展垃圾信息检测与趋势分析。

ABSTRACT

Twitter is used for a variety of reasons, including information dissemination, marketing, political organizing and to spread propaganda, spamming, promotion, conversations, and so on. Characterizing these activities and categorizing associated user generated content is a challenging task. We present a information-theoretic approach to classification of user activity on Twitter. We focus on tweets that contain embedded URLs and study their collective `retweeting' dynamics. We identify two features, time-interval and user entropy, which we use to classify retweeting activity. We achieve good separation of different activities using just these two features and are able to categorize content based on the collective user response it generates. We have identified five distinct categories of retweeting activity on Twitter: automatic/robotic activity, newsworthy information dissemination, advertising and promotion, campaigns, and parasitic advertisement. In the course of our investigations, we have shown how Twitter can be exploited for promotional and spam-like activities. The content-independent, entropy-based activity classification method is computationally efficient, scalable and robust to sampling and missing data. It has many applications, including automatic spam-detection, trend identification, trust management, user-modeling, social search and content classification on online social media.

研究动机与目标

为解决推特上多样化且复杂的用户活动（如垃圾信息、宣传和自然的信息传播）的分类挑战。
开发一种与内容和语言无关的方法，利用群体用户响应动态进行活动分类。
识别并区分人类驱动的转发与自动化或机器人驱动的活动。
实现实际应用，如垃圾信息检测、信任管理以及在线社交媒体平台中的内容分类。

提出的方法

使用URL作为标记来追踪内容传播并识别转发，无论是否包含“RT”或遵循原发帖者。
通过两种分布表征转发动态：连续转发之间的时间间隔，以及参与的独立用户数量。
应用香农熵来量化时间间隔分布和用户分布中的不确定性或随机性。
将这两个分布的熵值作为关键特征，用于将转发行为分类为不同类别。
仅依赖于观察到的用户响应模式，避免依赖内容、语言或显式用户评分。
利用生成的特征空间训练分类器，将活动分离为有意义且经实证验证的类别。

实验结果

研究问题

RQ1如何对转发动态进行定量表征，以区分推特上不同类型的用户活动？
RQ2基于熵的特征能否有效区分人类驱动的转发与自动化或机器人驱动的活动？
RQ3与内容无关的动力学特征在多大程度上可将转发行为分类为有意义的类别？
RQ4新闻、广告和垃圾信息等不同活动的时间间隔熵和用户分布熵值有何差异？
RQ5该方法能否检测出可规避传统基于内容过滤的复杂垃圾信息和推广活动？

主要发现

基于熵的方法成功将转发活动划分为五类不同类别：具有新闻价值的信息传播、广告与推广、活动推广、自动/机器人行为以及寄生广告。
自动化转发的时间间隔熵显著较低，可清晰区分于人类驱动的活动。
用户熵有效捕捉了参与用户的多样性，能够区分广泛传播的新闻与目标性或重复性的推广活动。
该方法对采样和缺失数据具有鲁棒性，且无需内容分析或语言处理。
模型识别为类似垃圾信息的多个账号后来被推特平台封禁，验证了该方法在现实世界中的检测能力。
该方法能高精度地自动检测具有新闻价值的内容，并将其与低价值或推广性内容区分开来，且不依赖语言或内容类型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。