QUICK REVIEW

[论文解读] Using Lexical Features for Malicious URL Detection -- A Machine Learning Approach

Apoorva Joshi, Levi Lloyd|arXiv (Cornell University)|Oct 14, 2019

Spam and Phishing Detection被引用 24

一句话总结

该论文提出了一种基于机器学习的集成模型，通过直接从URL字符串中提取静态词汇特征来检测恶意URL，具有高灵敏度。该方法实现了0.1%的平均假阴性率、92%的准确率以及0.98的AUC，显著提升了实时电子邮件安全工作流中的检测能力，且延迟极低。

ABSTRACT

Malicious websites are responsible for a majority of the cyber-attacks and scams today. Malicious URLs are delivered to unsuspecting users via email, text messages, pop-ups or advertisements. Clicking on or crawling such URLs can result in compromised email accounts, launching of phishing campaigns, download of malware, spyware and ransomware, as well as severe monetary losses. A machine learning based ensemble classification approach is proposed to detect malicious URLs in emails, which can be extended to other methods of delivery of malicious URLs. The approach uses static lexical features extracted from the URL string, with the assumption that these features are notably different for malicious and benign URLs. The use of such static features is safer and faster since it does not involve crawling the URLs or blacklist lookups which tend to introduce a significant amount of latency in producing verdicts. The goal of the classification was to achieve high sensitivity i.e. detect as many malicious URLs as possible. URL strings tend to be very unstructured and noisy. Hence, bagging algorithms were found to be a good fit for the task since they average out multiple learners trained on different parts of the training data, thus reducing variance. The classification model was tested on five different testing sets and produced an average False Negative Rate (FNR) of 0.1%, average accuracy of 92% and average AUC of 0.98. The model is presently being used in the FireEye Advanced URL Detection Engine (used to detect malicious URLs in emails), to generate fast real-time verdicts on URLs. The malicious URL detections from the engine have gone up by 22% since the deployment of the model into the engine workflow. The results obtained show noteworthy evidence that a purely lexical approach can be used to detect malicious URLs.

研究动机与目标

应对网络攻击中恶意URL日益增长的威胁，特别是通过网络钓鱼和恶意软件分发的方式。
通过避免动态分析或黑名单查询，降低恶意URL检测的延迟。
提高检测灵敏度，以最小化电子邮件安全工作流中漏检的恶意URL。
证明仅使用词汇特征即可有效区分恶意与良性URL。
开发一种可扩展的、可实时部署的解决方案，适用于生产环境安全系统，如FireEye的URL检测引擎。

提出的方法

从URL字符串中提取静态词汇特征，例如长度、特殊字符、数字频率以及子域名模式。
应用基于装袋（bagging）的集成学习方法（如随机森林或类似算法），以降低方差并提高在噪声大、非结构化URL上的鲁棒性。
在多样化的URL数据集上训练模型，以实现对不同恶意URL模式的泛化能力。
采用基于阈值的分类方法，优先降低假阴性率，与高灵敏度的目标保持一致。
通过特征工程捕捉恶意URL中常见的语言学和句法异常。
将模型部署于实时处理管道中，用于基于电子邮件的URL分析，避免资源密集型的爬取或外部查询。

实验结果

研究问题

RQ1仅从URL字符串中提取的词汇特征能否有效区分恶意与良性URL？
RQ2与其它机器学习方法相比，基于装袋的集成模型在检测恶意URL方面的表现如何？
RQ3在不依赖URL爬取或外部黑名单的情况下，静态词汇特征方法能在多大程度上实现高灵敏度？
RQ4此类模型对生产环境中真实世界检测性能有何影响？
RQ5该模型在多样化的真实世界URL数据集上的性能表现如何扩展？

主要发现

该模型实现了平均0.1%的假阴性率（FNR），表明对恶意URL的检测近乎完全。
该模型在五个不同的测试集上平均准确率达到92%，表现出强大的泛化能力。
受试者工作特征曲线下面积（AUC）平均为0.98，表明对恶意与良性URL具有极佳的判别性能。
在FireEye的高级URL检测引擎中部署后，恶意URL检测数量在集成后提升了22%。
该方法被证明高效且有效，可在无需动态分析或外部查询的情况下实现低延迟、实时判定。
结果有力表明，仅依靠词汇特征即可作为高精度恶意URL检测的可靠基础。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。