Skip to main content
QUICK REVIEW

[论文解读] NtMalDetect: A Machine Learning Approach to Malware Detection Using Native API System Calls

Chan Woo Kim|arXiv (Cornell University)|Feb 15, 2018
Advanced Malware Detection Techniques参考文献 4被引用 31
一句话总结

该论文提出NtMalDetect,一种动态恶意软件检测系统,将系统调用轨迹视为文本文档,并应用自然语言处理技术——特别是TF-IDF加权n-gram与线性SVM——对良性与恶意程序进行二分类。该系统利用系统调用序列实现了96%的准确率和95%的召回率,通过随机梯度下降优化的SVM在性能与效率方面表现最佳。

ABSTRACT

As computing systems become increasingly advanced and as users increasingly engage themselves in technology, security has never been a greater concern. In malware detection, static analysis, the method of analyzing potentially malicious files, has been the prominent approach. This approach, however, quickly falls short as malicious programs become more advanced and adopt the capabilities of obfuscating its binaries to execute the same malicious functions, making static analysis extremely difficult for newer variants. The approach assessed in this paper is a novel dynamic malware analysis method, which may generalize better than static analysis to newer variants. Inspired by recent successes in Natural Language Processing (NLP), widely used document classification techniques were assessed in detecting malware by doing such analysis on system calls, which contain useful information about the operation of a program as requests that the program makes of the kernel. Features considered are extracted from system call traces of benign and malicious programs, and the task to classify these traces is treated as a binary document classification task of system call traces. The system call traces were processed to remove the parameters to only leave the system call function names. The features were grouped into various n-grams and weighted with Term Frequency-Inverse Document Frequency. This paper shows that Linear Support Vector Machines (SVM) optimized by Stochastic Gradient Descent and the traditional Coordinate Descent on the Wolfe Dual form of the SVM are effective in this approach, achieving a highest of 96% accuracy with 95% recall score. Additional contributions include the identification of significant system call sequences that could be avenues for further research.

研究动机与目标

  • 解决静态恶意软件分析在检测混淆或零日恶意软件变种方面的局限性。
  • 评估自然语言处理中的文档分类技术在分析系统调用轨迹以检测恶意软件方面的有效性。
  • 识别能够区分恶意与良性行为的最具信息量的系统调用序列。
  • 开发一个可部署的开源系统(NtMalDetect),集成训练好的分类器以供实际应用。

提出的方法

  • 从良性与恶意程序中提取系统调用轨迹,并去除参数,仅保留函数名称。
  • 通过系统调用序列生成n-gram特征(1-至10-gram),并使用TF-IDF加权以突出罕见但具有区分性的序列。
  • 使用多种机器学习模型进行二分类,包括通过SGD和坐标下降优化的线性SVM、k-NN和朴素贝叶斯。
  • 利用L1与L2正则化从SVM分类器中提取最具信息量的特征,以识别关键的系统调用模式。
  • 在最终的NtMalDetect系统中采用提升型集成分类器,以提高检测的鲁棒性。
  • 系统使用Scikit-learn实现,并作为开源项目发布在GitHub上。

实验结果

研究问题

  • RQ1能否有效将系统调用轨迹建模为文本文档,以利用NLP技术进行恶意软件分类?
  • RQ2哪些机器学习算法在将系统调用轨迹分类为良性或恶意方面表现最佳?
  • RQ3哪些系统调用n-gram序列最具区分性,可用于识别恶意行为?
  • RQ4不同的正则化与优化策略如何影响分类器的性能与效率?

主要发现

  • 通过随机梯度下降优化的线性SVM在测试集上实现了最高的96%准确率与95%召回率。
  • 采用L2惩罚的SGD优化SVM在训练与推理阶段均最快,测试时间低于0.001秒。
  • 通过L1正则化SVM识别出的最具信息量的特征包括对NtQueryInformationThread和NtMapViewOfSection的重复调用,表明其与恶意软件的行为模式相关。
  • L2正则化SVM凸显了涉及NtDelayExecution与NtDeviceIoControlFile的序列,提示其可能为恶意行为的指示信号。
  • 该系统在未见过的恶意软件变种上表现出强大的泛化能力,优于传统静态分析在检测混淆或零日威胁方面的能力。
  • 开源的NtMalDetect框架成功将训练好的模型集成到可部署工具中,实现实时恶意软件检测。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。