QUICK REVIEW

[논문 리뷰] A Survey on Automated Software Vulnerability Detection Using Machine Learning and Deep Learning

Nima Shiri Harzevili, Alvine Boaye Belle|arXiv (Cornell University)|2023. 06. 20.

Software Engineering Research인용 수 9

한 줄 요약

이 논문은 2011년부터 2022년까지 ML/DL 기반 소프트웨어 취약점 탐지에 대한 체계적 조사로, 37개 학술대회/저널에서 67편의 연구를 분석하고 데이터셋, 표현, 모델, 취약점 유형 및 해석 가능성을 다루며, 도전 과제와 향후 방향을 제시합니다.

ABSTRACT

Software vulnerability detection is critical in software security because it identifies potential bugs in software systems, enabling immediate remediation and mitigation measures to be implemented before they may be exploited. Automatic vulnerability identification is important because it can evaluate large codebases more efficiently than manual code auditing. Many Machine Learning (ML) and Deep Learning (DL) based models for detecting vulnerabilities in source code have been presented in recent years. However, a survey that summarises, classifies, and analyses the application of ML/DL models for vulnerability detection is missing. It may be difficult to discover gaps in existing research and potential for future improvement without a comprehensive survey. This could result in essential areas of research being overlooked or under-represented, leading to a skewed understanding of the state of the art in vulnerability detection. This work address that gap by presenting a systematic survey to characterize various features of ML/DL-based source code level software vulnerability detection approaches via five primary research questions (RQs). Specifically, our RQ1 examines the trend of publications that leverage ML/DL for vulnerability detection, including the evolution of research and the distribution of publication venues. RQ2 describes vulnerability datasets used by existing ML/DL-based models, including their sources, types, and representations, as well as analyses of the embedding techniques used by these approaches. RQ3 explores the model architectures and design assumptions of ML/DL-based vulnerability detection approaches. RQ4 summarises the type and frequency of vulnerabilities that are covered by existing studies. Lastly, RQ5 presents a list of current challenges to be researched and an outline of a potential research roadmap that highlights crucial opportunities for future work.

연구 동기 및 목표

ML/DL 기반 취약점 탐지 연구 및 발표 venue의 진화와 추세를 평가한다.
ML/DL 취약점 탐지를 위해 사용된 데이터셋의 특성을 파악한다(출처, 유형, 표현, 임베딩).
취약점 탐지를 위해 사용된 ML/DL 모델 아키텍처 및 설계 선택을 분류한다.
다루는 취약점의 범위를 식별하고 주요 도전 과제와 향후 연구 방향을 강조한다.
재현 패키지를 제공하여 조사 결과의 재현성 및 확장을 지원한다.

제안 방법

2011–2022년 ML/DL 기반 취약점 탐지 연구에 대한 체계적 문헌고찰.
targeted search terms를 사용하여 ScienceDirect, IEEE Xplore, ACM DL, 그리고 Google Scholar에서 데이터 수집.
소스 코드의 ML/DL 기반 취약점 탐지에 초점을 둔 포함 기준.
데이터셋, 표현, 임베딩, 모델, 취약점 유형 및 해석 가능성과 관련된 데이터의 추출 및 종합.
아키텍처별 모델 분류 및 기법 선택 전략 분석.
재현성을 위한 재현 자원(Colab 노트북) 제공.

실험 결과

연구 질문

RQ1RQ1: ML/DL 모델을 이용한 취약점 탐지 연구의 추세(시간적 추세 및 발표 venue 분포 포함)는 무엇인가?
RQ2RQ2: 소프트웨어 취약점 탐지에 사용된 실험 데이터셋의 특성은 무엇인가(데이터 출처, 유형, 표현, 임베딩)?
RQ3RQ3: 취약점 탐지를 위해 사용된 ML/DL 모델 및 아키텍처는 무엇인가?
RQ4RQ4: 이러한 연구들에서 가장 자주 다루어지는 취약점(유형)은 무엇인가?
RQ5RQ5: ML/DL를 이용한 소프트웨어 취약점 탐지의 도전 과제와 향후 방향은 무엇인가?

주요 결과

2011년부터 2022년까지 37개 저널/학회에서 ML/DL-based 취약점 탐지에 관한 관련 연구 67편을 분석하였다.
데이터셋, 데이터 처리, 표현, 임베딩, 모델 아키텍처, 해석 가능성, 취약점 유형에 대한 포괄적 분석을 제공한다.
취약점 탐지를 위해 사용된 ML/DL 모델을 아키텍처로 분류하고 모델 선택 전략을 분석한다.
ML/DL 기반 취약점 탐지의 구별된 기술적 도전 과제를 논의하고 향후 연구 방향을 제시한다.
후속 연구를 촉진하기 위해 재현 패키지로 결과 및 분석 데이터를 공유한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.