QUICK REVIEW

[论文解读] URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection

H Le, Quang Pham|arXiv (Cornell University)|Feb 9, 2018

Spam and Phishing Detection参考文献 34被引用 193

一句话总结

URLNet 使用双 CNN 分支在字符和单词上（附高级词嵌入）学习鲁棒的 URL 表示，以检测恶意 URL，优于传统词汇特征基线，并能有效处理未见/罕见词。

ABSTRACT

Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet.

研究动机与目标

促使在黑名单和手工设计的词汇特征之外实现鲁棒的恶意 URL 检测。
提出一个端到端的深度学习模型，直接从原始 URL 字符串学习 URL 嵌入。
通过字符级和词级 CNN 捕捉 URL 的语义和序列模式，并使用先进的词嵌入。
解决在大规模 URL 数据集中罕见/未见词和内存约束的问题。
在强基线的词汇特征上对 URLNet 进行评估，并进行消融研究以理解组件贡献。

提出的方法

引入带有两条 CNN 分支的 URLNet：字符级和词级的 URL 表示。
对两个分支使用多种滤波器尺寸（h 取 {3,4,5,6}），每种尺寸 256 个滤波器。
对于单词，采用将词级信息与字符级信息相结合的高级嵌入，以处理罕见和未见词。
将特殊字符作为词来包含，以捕捉额外的序列信息。
端到端训练，使用 dropout 和 Adam 优化器，在最终稠密层前拼接分支输出。

实验结果

研究问题

RQ1URLNet 是否能在恶意 URL 检测中超越基于词袋的传统词汇特征和手工设计的基线？
RQ2字符级、词级和完整 URLNet（URLNet Full）变体如何比较，组合模型的贡献是什么？
RQ3模型是否通过基于字符的单词嵌入和对特殊字符的处理，泛化到未见/罕见词？
RQ4训练数据规模对 URLNet 性能有何影响，不同特征结构对鲁棒性有何贡献？

主要发现

URLNet 变体在 AUC 和 TPR@FPR 的指标上显著超越基线词汇特征模型。
将字符级和词级 CNN 结合的 URLNet Full 产生最强且最一致的性能优势。
字符级和词级 CNN 提供互补优势，Full 模型利用两者在不同 FPR 下提升检测效果。
使用整合字符信息的高级词嵌入有助于解决内存约束并实现对未见词的处理。
将训练数据从 100 万 URL 提升到 500 万 URL 时，性能在各项指标上均有提升。
字符级 CNN 擅长识别长序列中的模式，词级 CNN 捕捉分词层面的语义；两者结合的效果优于单独使用任一模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。