[论文解读] Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media
一个无监督框架,使用社交媒体作为众包传感器来检测网络攻击(DDoS、数据泄露、账户劫持),通过基于依存关系树的模版和词嵌入动态扩展种子查询,在大规模 Twitter 数据上进行评估。
Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.
研究动机与目标
- Motivate the use of open social media signals as a sensor for cyber-attacks and reduce detection latency.
- Develop an unsupervised framework that maps limited seed triggers to expanded queries to detect events.
- Model reporting structure of cyber-attacks in social media via dependency parses and word embeddings.
- Evaluate the approach on large-scale Twitter data across three attack categories (DDOS, data breach, account hijacking).
提出的方法
- Introduce Target Domain Generation to collect tweets syntactically and semantically similar to seed queries using a convolution tree kernel over dependency trees.
- Propose Dynamic Typed Query Expansion that iteratively expands seed queries by selecting candidate expansions via KL divergence to distinguish target domain from the global tweet collection.
- Represent events as (Q_e, date, type) where Q_e is a set of expanded queries tied to a cyber-attack type.
- Cluster exemplars of expanded queries and annotate exemplars to attack types based on similarity to initial seeds.
- Evaluate using a large GNIP Twitter dataset (Aug 2014–Oct 2016) with gold-standard reports from Hackmageddon and PrivacyRights.
实验结果
研究问题
- RQ1Can a small set of seed typed dependency queries be expanded dynamically to cover a broad range of cyber-attack reports in social media?
- RQ2Does a convolution-tree kernel plus word embedding-based similarity improve target domain generation over naive keyword methods?
- RQ3How well can unsupervised, seed-driven query expansion detect and characterize data breaches, account hijackings, and DDoS events in Twitter?
- RQ4What are the precision/recall trade-offs of the proposed method compared to a traditional burst-detection baseline?
- RQ5Can detected events be matched to established ground-truth cyber-attack datasets to validate performance?
主要发现
- The method achieves around 0.78 precision and 0.74 recall for data breaches and 0.80 precision with 0.45 recall for DDoS events, with account hijacking at 0.66 precision and 0.56 recall.
- Recall is higher for data breaches (approximately 0.75) than for DDoS or account hijacking due to shorter signal lifecycles for those attacks.
- Baseline Kleinberg burst-detection on fixed keywords yields lower alignment with ground truth compared to the typed dynamic query expansion approach.
- The approach detects additional events not listed in gold-standard sources, indicating discovery of new cyber-attack reports from social media.
- Case studies demonstrate detection of high-profile incidents (e.g., Ashley Madison data breach, Sony/Dyn DDoS, CentCom account hijacking) with interpretable expanded queries.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。