QUICK REVIEW

[논문 리뷰] Malicious URL Detection using Machine Learning: A Survey

Doyen Sahoo, Chenghao Liu|arXiv (Cornell University)|2017. 01. 25.

Spam and Phishing Detection참고 문헌 193인용 수 275

한 줄 요약

종합적인 ML 접근법에 대한 악성 URL 탐지의 포괄적 조사로, 특징 표현, 학습 알고리즘 및 전통적인 블랙리스트를 넘어서는 시스템 설계 고려사항을 자세히 다룹니다.

ABSTRACT

Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and practical applications. We also discuss practical issues in system design, open research challenges, and point out some important directions for future research.

연구 동기 및 목표

악성 URL 탐지를 기계 학습 과제로 형식화하기(이진 분류).
특징 표현 및 학습 알고리즘별로 문헌을 분류하고 검토하기.
실용적 시스템 설계, 미해결 연구 과제 및 향후 방향에 대해 논의하기.

제안 방법

URL에서 특징 추출을 통해 이진 분류로 문제를 형식화하기.
정적(실행 불가) 분석 특징의 검토 및 ML 성능에 대한 영향.
특징 유형의 분류: 블랙리스트, 렉시오널, 호스트 기반, 콘텐츠 기반 및 기타.
스케일 및 희소성 문제를 다루기 위한 온라인 학습을 포함한 학습 알고리즘에 대한 논의.
서비스로서의 악성 URL 탐지 및 실용적 배치 이슈에 대한 고려.

실험 결과

연구 질문

RQ1정적 분석 전반에서 악성 URL과 정상 URL을 구분하는 데 효과적인 특징 표현은 무엇인가?
RQ2대규모의 희소한 URL 데이터에 대해 어떤 머신러닝 알고리즘과 학습 전략이 최적으로 작동하는가?
RQ3악성 URL 탐지 시스템 배포에 대한 실제적 도전과 설계 고려사항은 무엇인가?
RQ4다양한 특징 범주(렉시오셜, 호스트 기반, 콘텐츠 기반 등)가 탐지 성능에 어떻게 기여하는가?

주요 결과

머신 러닝 접근법은 블랙리스트를 넘어 새로운 URL에 대한 일반화를 가능하게 한다.
정적 분석 특징이 핵심이며, 렉시오셜, 호스트 기반, 콘텐츠 기반 특징이 성능을 주도한다.
온라인 학습 및 희소성 인식 방법이 대규모 URL 데이터 세트의 확장성 문제를 해결한다.
악성 URL 탐지를 위한 특징 표현 및 알고리즘의 체계적 프레임워크와 분류가 제공된다.
이 연구는 실용적 시스템 설계, 열려 있는 도전과제 및 향후 연구 방향을 다룬다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.