QUICK REVIEW

[论文解读] The Pulse of News in Social Media: Forecasting Popularity

Roja Bandari, Sitaram Asur|arXiv (Cornell University)|Feb 2, 2012

Complex Network Analysis Techniques参考文献 28被引用 110

一句话总结

本文提出一种基于内容的新闻文章在Twitter上发布前预测其受欢迎程度的方法，使用来源、类别、主观性及命名实体等特征。通过机器学习方法，该方法在分类受欢迎程度范围（低/中/高转发量）时达到84%的准确率，其中文章来源是最重要的预测因素，凸显了传统新闻机构与社交媒体上顶级病毒式传播来源之间的差异。

ABSTRACT

News articles are extremely time sensitive by nature. There is also intense competition among news items to propagate as widely as possible. Hence, the task of predicting the popularity of news items on the social web is both interesting and challenging. Prior research has dealt with predicting eventual online popularity based on early popularity. It is most desirable, however, to predict the popularity of items prior to their release, fostering the possibility of appropriate decision making to modify an article and the manner of its publication. In this paper, we construct a multi-dimensional feature space derived from properties of an article and evaluate the efficacy of these features to serve as predictors of online popularity. We examine both regression and classification algorithms and demonstrate that despite randomness in human behavior, it is possible to predict ranges of popularity on twitter with an overall 84% accuracy. Our study also serves to illustrate the differences between traditionally prominent sources and those immensely popular on the social web.

研究动机与目标

在仅使用内容级特征的前提下，预测新闻文章在Twitter上的在线受欢迎程度，且发布前即可完成。
确定是否需要早期受欢迎度指标来进行预测，或仅靠内容特征是否已足够。
识别哪些文章级特征——来源、类别、主观性、命名实体——最能预测社交媒体上的病毒式传播。
比较传统新闻机构与新兴社交媒体影响者在内容传播中的影响力。
评估仅使用内容特征预测受欢迎程度范围（而非确切转发数）的可行性。

提出的方法

使用四种基于内容的特征构建多维特征空间：新闻来源、新闻类别、语言主观性及命名实体。
基于语言学分析和元数据，使用预定义的评分函数为每个特征分配数值分数。
应用回归和分类模型（SVM、决策树、集成方法、朴素贝叶斯）来预测Twitter上的受欢迎程度范围。
使用10折交叉验证评估模型性能，确保结果稳健。
通过逐个移除某一特征进行消融研究，评估各特征的独立贡献。
使用相同的特征集进行二分类任务，预测文章是否会获得零转发量或非零转发量。

实验结果

研究问题

RQ1是否可以仅使用发布前的内容特征，而不依赖早期互动指标，来预测Twitter上新闻的受欢迎程度？
RQ2在来源、类别、主观性或命名实体中，哪一内容特征最能预测新闻文章在Twitter上的受欢迎程度范围？
RQ3文章来源的预测能力与传统新闻机构相比，与社交媒体熟练型博客相比有何差异？
RQ4语言的主观性在多大程度上影响文章在Twitter上被转发的可能性？
RQ5是否可以仅使用内容特征，区分出将被转发（非零转发）和不会被转发的文章？

主要发现

所提出的方法仅使用发布前的内容特征，即在预测受欢迎程度范围（低、中、高转发量）时，整体分类准确率达到84%。
新闻文章的来源是最重要的预测因子，对文章是否在Twitter上病毒式传播具有显著影响。
来自科技博客（如Mashable和Google Blog）的文章尽管并非传统主流新闻机构，却成为传播最广的文章之一。
主观性和命名实体并未显著提升预测性能，表明读者并不偏好更具主观性或实体更丰富的文章。
类别特征在预测Twitter上的受欢迎程度方面无显著帮助，但有助于判断文章是否会获得传播，可能由于平台本身对科技类内容存在偏向。
针对零转发量与非零转发量文章的二分类任务达到66%的准确率，其中来源和类别是该预测中最具信息量的特征。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。