QUICK REVIEW

[论文解读] Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features

Joshua Saxe, Konstantin Berlin|arXiv (Cornell University)|Aug 13, 2015

Advanced Malware Detection Techniques参考文献 24被引用 65

一句话总结

本文提出了一种基于深度神经网络（DNN）的恶意软件检测系统，利用二维二进制程序特征——具体而言，通过1024字节滑动窗口计算字节熵直方图——无需人工过滤或解包。该方法在超过40万款真实世界二进制文件上实现了0.1%误报率下的95%检测率，展示了在通用硬件上高精度与低误报率的性能，且已在实际企业环境中部署。

ABSTRACT

Malware remains a serious problem for corporations, government agencies, and individuals, as attackers continue to use it as a tool to effect frequent and costly network intrusions. Machine learning holds the promise of automating the work required to detect newly discovered malware families, and could potentially learn generalizations about malware and benign software that support the detection of entirely new, unknown malware families. Unfortunately, few proposed machine learning based malware detection methods have achieved the low false positive rates required to deliver deployable detectors. In this paper we a deep neural network malware classifier that achieves a usable detection rate at an extremely low false positive rate and scales to real world training example volumes on commodity hardware. Specifically, we show that our system achieves a 95% detection rate at 0.1% false positive rate (FPR), based on more than 400,000 software binaries sourced directly from our customers and internal malware databases. We achieve these results by directly learning on all binaries, without any filtering, unpacking, or manually separating binary files into categories. Further, we confirm our false positive rates directly on a live stream of files coming in from Invincea's deployed endpoint solution, provide an estimate of how many new binary files we expected to see a day on an enterprise network, and describe how that relates to the false positive rate and translates into an intuitive threat score. Our results demonstrate that it is now feasible to quickly train and deploy a low resource, highly accurate machine learning classification model, with false positive rates that approach traditional labor intensive signature based methods, while also detecting previously unseen malware.

研究动机与目标

开发一种可扩展、低资源消耗的恶意软件检测系统，实现高检测率与极低误报率。
消除对人工预处理（如按打包器类型解包或过滤二进制文件）的需求。
实现在具有实时文件流的生产型企业环境中部署机器学习模型。
在真实世界数据上验证模型性能，包括来自客户终端的实时流量。
证明深度学习可达到或超越传统基于签名的方法在准确性方面的表现，同时检测此前未见过的恶意软件。

提出的方法

通过在二进制文件上以256字节步长滑动1024字节窗口，提取二维字节熵直方图。
为每个窗口计算以2为底的熵值和字节频率，然后在熵值（0–8）与字节值（0–255）上构建16×16的直方图。
将直方图的行拼接为固定长度的特征向量，输入至具有两层隐藏层的深度神经网络。
在未经按打包器、混淆技术或其他二进制特征分类的原始二进制文件上训练深度神经网络分类器。
应用贝叶斯校准，将原始神经网络输出转换为可解释的威胁评分，近似表示恶意软件概率。
采用增量训练与紧凑模型权重，以支持在低端硬件上实现低延迟实时部署与高效推理。

实验结果

研究问题

RQ1直接在原始二进制文件上训练的深度神经网络能否实现高检测准确率与低误报率？
RQ2从滑动窗口导出的二维二进制特征能否在无需人工特征工程的情况下捕捉恶意软件检测的判别性模式？
RQ3该模型在企业终端生成的真实世界、未标记文件流上的表现如何？
RQ4该系统能否在保持低误报率的同时，实现基于通用硬件的大规模训练？
RQ5该模型在无需重新训练的情况下，对未见过的恶意软件家族的泛化能力如何？

主要发现

该系统在从客户和内部恶意软件数据库收集的超过40万款真实世界软件二进制文件上，实现了0.1%误报率下的95%检测率。
误报率已在Invincea生产部署的终端解决方案的实时文件流上直接验证，确认了其在真实环境中的可靠性。
模型仅使用单张GPU完成训练与部署，证明了其在通用硬件上的可扩展性。
该方法无需人工预处理（如解包或按打包器类型过滤），可直接对所有二进制文件进行训练。
该系统已成功集成至Invincea的云安全分析平台，并被实际用于数千个客户终端的恶意软件检测。
经贝叶斯校准的威胁评分提供了直观的概率化恶意软件可能性解释，显著提升了操作可用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。