Skip to main content
QUICK REVIEW

[论文解读] A Survey on Automated Software Vulnerability Detection Using Machine Learning and Deep Learning

Nima Shiri Harzevili, Alvine Boaye Belle|arXiv (Cornell University)|Jun 20, 2023
Software Engineering Research被引用 9
一句话总结

本文对基于ML/DL的软件漏洞检测在2011年至2022年的研究进行了系统综述,分析了来自37个场所的67项研究,覆盖数据集、表示、模型、漏洞类型和可解释性,并概述了挑战与未来方向。

ABSTRACT

Software vulnerability detection is critical in software security because it identifies potential bugs in software systems, enabling immediate remediation and mitigation measures to be implemented before they may be exploited. Automatic vulnerability identification is important because it can evaluate large codebases more efficiently than manual code auditing. Many Machine Learning (ML) and Deep Learning (DL) based models for detecting vulnerabilities in source code have been presented in recent years. However, a survey that summarises, classifies, and analyses the application of ML/DL models for vulnerability detection is missing. It may be difficult to discover gaps in existing research and potential for future improvement without a comprehensive survey. This could result in essential areas of research being overlooked or under-represented, leading to a skewed understanding of the state of the art in vulnerability detection. This work address that gap by presenting a systematic survey to characterize various features of ML/DL-based source code level software vulnerability detection approaches via five primary research questions (RQs). Specifically, our RQ1 examines the trend of publications that leverage ML/DL for vulnerability detection, including the evolution of research and the distribution of publication venues. RQ2 describes vulnerability datasets used by existing ML/DL-based models, including their sources, types, and representations, as well as analyses of the embedding techniques used by these approaches. RQ3 explores the model architectures and design assumptions of ML/DL-based vulnerability detection approaches. RQ4 summarises the type and frequency of vulnerabilities that are covered by existing studies. Lastly, RQ5 presents a list of current challenges to be researched and an outline of a potential research roadmap that highlights crucial opportunities for future work.

研究动机与目标

  • 评估基于ML/DL的漏洞检测研究及出版渠道的演变与趋势。
  • 描述用于ML/DL漏洞检测的数据集特征,包括来源、类型、表示和嵌入。
  • 对用于漏洞检测的ML/DL模型架构与设计选择进行分类。
  • 识别覆盖的漏洞类型范围,并突出关键挑战与未来研究方向。
  • 提供复现实验包以支持结果的可重复性与扩展性。

提出的方法

  • 对2011–2022年的基于ML/DL的漏洞检测研究进行系统性文献综述。
  • 使用定向检索术语从ScienceDirect、IEEE Xplore、ACM DL和Google Scholar收集数据。
  • 纳入标准基于聚焦于对源代码的ML/DL漏洞检测。
  • 提取并综合与数据集、表示、嵌入、模型、漏洞类型和可解释性相关的数据。
  • 按架构对模型进行分类并分析技术选择策略。
  • 提供复现实验资源(Colab笔记本)以实现可重复性。

实验结果

研究问题

  • RQ1RQ1:使用ML/DL模型进行漏洞检测的研究趋势是什么,包括时间趋势和刊物分布?
  • RQ2RQ2:用于软件漏洞检测的实验数据集有哪些特征(数据源、类型、表示、嵌入)?
  • RQ3RQ3:用于漏洞检测的ML/DL模型和架构有哪些?
  • RQ4RQ4:这些研究中最常覆盖的漏洞类型有哪些?
  • RQ5RQ5:基于ML/DL的软件漏洞检测面临的挑战与未来方向是什么?

主要发现

  • 我们分析了2011年至2022年在37个期刊/会议上关于基于ML/DL的漏洞检测的67篇相关研究。
  • 我们提供对数据集、数据处理、表示、嵌入、模型架构、可解释性和漏洞类型的全面分析。
  • 我们按架构对用于漏洞检测的ML/DL模型进行分类,并分析模型选择策略。
  • 我们讨论了不同的技术挑战并勾勒出基于ML/DL的漏洞检测的未来研究方向。
  • 我们将结果和分析数据作为复现实验包分享,以促进后续工作。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。