[论文解读] Fast & Furious: Modelling Malware Detection as Evolving Data Streams
本文提出了一种新颖的数据流学习流水线,用于Android恶意软件检测,能够同时响应概念漂移和特征漂移,对分类器和特征提取器进行联合自适应。通过在2009–2018年间从DREBIN和AndroZoo收集的41.5万款Android应用上使用Word2Vec和TF-IDF特征进行训练,结果表明,在漂移点同时更新两个组件可使F1-score在DREBIN上提升22.05个百分点,在AndroZoo上提升8.77个百分点,优于静态模型和仅更新分类器的模型。
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples' features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates, something not considered in the majority of the literature work. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (about 130K apps) and a subset of AndroZoo (about 285K apps). We used these datasets to train an Adaptive Random Forest (ARF) classifier, as well as a Stochastic Gradient Descent (SGD) classifier. We also ordered all datasets samples using their VirusTotal submission timestamp and then extracted features from their textual attributes using two algorithms (Word2Vec and TF-IDF). Then, we conducted experiments comparing both feature extractors, classifiers, as well as four drift detectors (DDM, EDDM, ADWIN, and KSWIN) to determine the best approach for real environments. Finally, we compare some possible approaches to mitigate concept drift and propose a novel data stream pipeline that updates both the classifier and the feature extractor. To do so, we conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches.
研究动机与目标
- 调查概念漂移是否是Android恶意软件数据集中的普遍现象,而非局限于个别案例。
- 评估在发生概念漂移时,是否需要同时更新特征提取器和分类器以维持长期的高检测准确率。
- 确定用于长期恶意软件检测的最优特征提取、分类和漂移检测技术组合。
- 提出并验证一种实时、自适应的恶意软件检测流水线,以缓解因恶意软件演化导致的性能下降。
提出的方法
- 作者根据VirusTotal提交时间戳,收集并按时间顺序整理了来自DREBIN和AndroZoo的41.5万款Android应用,以模拟真实世界的数据流。
- 使用两种特征表示方法(TF-IDF和Word2Vec)提取文本属性(如权限、API调用).
- 训练并更新两种分类器——自适应随机森林(ARF)和随机梯度下降(SGD)——以响应检测到的漂移。
- 评估四种漂移检测器(DDM、EDDM、ADWIN、KSWIN)以识别数据流中的概念漂移。
- 所提出的流水线在检测到漂移时动态更新分类器和特征提取器,确保表示和预测的适应性。
- 该系统作为scikit-multiflow的扩展实现,支持可复现性和未来研究的可扩展性。
实验结果
研究问题
- RQ1概念漂移是否是跨多样化Android恶意软件数据集的普遍现象,还是仅限于特定数据分布?
- RQ2在发生概念漂移时,是否必须同时更新特征提取器和分类器,以维持长期的高检测准确率?
- RQ3哪种特征提取器、分类器和漂移检测器的组合能在恶意软件检测中实现最佳长期性能?
- RQ4模型更新时机(基于漂移触发 vs. 固定窗口)如何影响检测性能?
- RQ5恶意软件演化与Android生态系统变化之间的相关性有多大?
主要发现
- 仅在检测到概念漂移后重置分类器,其性能优于基于固定时间窗口的周期性重置。
- 在漂移点同时更新分类器和特征提取器可实现最高的检测性能,在DREBIN数据集上使F1-score提升22.05个百分点。
- 在更大的AndroZoo数据集上,联合自适应带来的提升也十分显著,F1-score提高了8.77个百分点。
- KSWIN漂移检测器在两个数据集上均优于DDM、EDDM和ADWIN,能更有效地检测概念漂移。
- 该研究证实,恶意软件演化同时引发概念漂移和特征漂移,随着时间推移,会涌现出新的API调用、权限和词汇。
- 所提出的流水线作为scikit-multiflow的扩展提供,支持社区采用,并推动自适应恶意软件检测的进一步研究。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。