[论文解读] eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys
eXpose 使用字符级 CNN 结合可学习嵌入,直接从原始字符串自动检测恶意 URL、文件路径和注册表键,在低误报率下超过手工特征基线。
For years security machine learning research has promised to obviate the need for signature based detection by automatically learning to detect indicators of attack. Unfortunately, this vision hasn't come to fruition: in fact, developing and maintaining today's security machine learning systems can require engineering resources that are comparable to that of signature-based detection systems, due in part to the need to develop and continuously tune the "features" these machine learning systems look at as attacks evolve. Deep learning, a subfield of machine learning, promises to change this by operating on raw input signals and automating the process of feature design and extraction. In this paper we propose the eXpose neural network, which uses a deep learning approach we have developed to take generic, raw short character strings as input (a common case for security inputs, which include artifacts like potentially malicious URLs, file paths, named pipes, named mutexes, and registry keys), and learns to simultaneously extract features and classify using character-level embeddings and convolutional neural network. In addition to completely automating the feature design and extraction process, eXpose outperforms manual feature extraction based baselines on all of the intrusion detection problems we tested it on, yielding a 5%-10% detection rate gain at 0.1% false positive rate compared to these baselines.
研究动机与目标
- 通过对原始字符串输入使用端到端深度学习,消除网络安全检测中的手工特征工程。
- 展示一个单一模型能够检测多种 artifact 类型(URLs、文件路径、注册表键)。
- 表明嵌入 + CNN 特征在大型安全数据集上优于传统的 n-gram 与专家特征基线。
- 评估在部署相关的误报率下的性能,以评估实际效用。
提出的方法
- 将原始字符字符串嵌入到可训练嵌入的密集向量空间。
- 对嵌入序列应用多核一维卷积以检测局部模式。
- 用和池化聚合卷积激活以产生固定长度特征。
- 在一个带有 dropout 和 batch normalization 的密集分类器中结合特征,以减少过拟合。
- 使用 Adam 端到端训练,平衡良性/恶意样本批以解决类别不平衡。
- 使用 ROC/AUC 指标与 n-gram 基线和专家特征进行比较。
实验结果
研究问题
- RQ1端到端字符级 CNN 结合嵌入能否从原始字符串中通过端到端字符级 CNN 与嵌入学习到有用的表征以检测恶意安全 artefacts?
- RQ2嵌入 + 卷积特征在 URL、文件路径和注册表键上的表现是否优于手工特征提取基线?
- RQ3该方法对在安全产物中常见的短字符串的混淆与变异是否鲁棒?
主要发现
| Task | Model | TPR @ 1e-4 | TPR @ 1e-3 | TPR @ 1e-2 | AUC |
|---|---|---|---|---|---|
| URLs | Convnet | 0.77 | 0.84 | 0.92 | 0.993 |
| URLs | N-gram | 0.76 | 0.78 | 0.84 | 0.985 |
| URLs | Expert | 0.74 | 0.78 | 0.84 | 0.985 |
| 文件路径 | Convnet | 0.16 | 0.43 | 0.68 | 0.978 |
| 文件路径 | N-gram | 0.18 | 0.33 | 0.65 | 0.972 |
| 注册表键 | Convnet | 0.51 | 0.62 | 0.86 | 0.992 |
| 注册表键 | N-gram | 0.11 | 0.49 | 0.72 | 0.988 |
- eXpose 使用单一体系结构实现对三种产物类型(URL、文件路径、注册表键)的泛化。
- 在误报率为 1e-3 时,eXpose 在各任务上达到的真阳性率高于 n-gram 与专家特征基线。
- 对于 URL,eXpose 在 1e-3 时达到 0.84 TPRA 和 0.993 AUC,超越达到 0.78 TPRA 和 0.985 AUC 的基线。
- 对于 文件路径,eXpose 在 1e-3 时达到 0.43 TPRA 和 0.978 AUC,优于基线的 0.33 TPRA 与 0.972 AUC。
- 对于 注册表键,eXpose 在 1e-3 时达到 0.62 TPRA 和 0.992 AUC,优于基线的 0.49 TPRA 与 0.988 AUC。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。