QUICK REVIEW

[논문 리뷰] Using Lexical Features for Malicious URL Detection -- A Machine Learning Approach

Apoorva Joshi, Levi Lloyd|arXiv (Cornell University)|2019. 10. 14.

Spam and Phishing Detection인용 수 24

한 줄 요약

이 논문은 URL 문자열에서 직접 추출한 정적 어휘적 특징을 사용하여 악성 URL을 높은 민감도로 탐지하는 기계학습 앙상블 모델을 제안한다. 이 방법은 평균 0.1%의 가짜 음성 비율, 92%의 정확도, 0.98의 AUC를 달성하여 실시간 이메일 보안 워크플로우에서 탐지 성능을 크게 향상시키며, 지연 시간이 최소한이다.

ABSTRACT

Malicious websites are responsible for a majority of the cyber-attacks and scams today. Malicious URLs are delivered to unsuspecting users via email, text messages, pop-ups or advertisements. Clicking on or crawling such URLs can result in compromised email accounts, launching of phishing campaigns, download of malware, spyware and ransomware, as well as severe monetary losses. A machine learning based ensemble classification approach is proposed to detect malicious URLs in emails, which can be extended to other methods of delivery of malicious URLs. The approach uses static lexical features extracted from the URL string, with the assumption that these features are notably different for malicious and benign URLs. The use of such static features is safer and faster since it does not involve crawling the URLs or blacklist lookups which tend to introduce a significant amount of latency in producing verdicts. The goal of the classification was to achieve high sensitivity i.e. detect as many malicious URLs as possible. URL strings tend to be very unstructured and noisy. Hence, bagging algorithms were found to be a good fit for the task since they average out multiple learners trained on different parts of the training data, thus reducing variance. The classification model was tested on five different testing sets and produced an average False Negative Rate (FNR) of 0.1%, average accuracy of 92% and average AUC of 0.98. The model is presently being used in the FireEye Advanced URL Detection Engine (used to detect malicious URLs in emails), to generate fast real-time verdicts on URLs. The malicious URL detections from the engine have gone up by 22% since the deployment of the model into the engine workflow. The results obtained show noteworthy evidence that a purely lexical approach can be used to detect malicious URLs.

연구 동기 및 목표

피싱 및 악성코드 배포를 통해 증가하는 사이버공격에서 악성 URL의 위협을 다루기.
동적 분석이나 블랙리스트 조회를 피하여 악성 URL 탐지의 지연 시간을 줄이기.
이메일 보안 워크플로우에서 악성 URL을 놓치는 것을 최소화하기 위해 탐지 민감도를 향상시키기.
순수하게 어휘적 특징이 악성 URL과 양성 URL을 구분하는 데 얼마나 효과적인지 입증하기.
생산 환경 보안 시스템(예: FireEye의 URL 탐지 엔진)에 배포 가능한 확장 가능한 실시간 솔루션 개발하기.

제안 방법

길이, 특수문자, 숫자 빈도, 서브도메인 패턴과 같은 URL 문자열에서 정적 어휘적 특징을 추출한다.
변동성을 줄이고 노이즈가 많은 비정형 URL에서의 강건성을 향상시키기 위해 백싱 기반 앙상블 학습(예: 랜덤 포레스트 또는 유사 기법)을 적용한다.
다양한 악성 URL 패턴을 일반화하기 위해 다양한 URL 데이터셋으로 모델을 훈련시킨다.
가짜 음성 최소화를 우선시하기 위해 임계값 기반 분류를 사용하여 높은 민감도와 일치시킨다.
언어학적 및 문법적 비정상성을 포착하기 위해 특징 공학을 활용한다.
자원을 많이 소비하는 크롤링이나 외부 조회를 피하기 위해 실시간 파이프라인에 모델을 배포하여 이메일 기반 URL 분석을 수행한다.

실험 결과

연구 질문

RQ1URL 문자열의 순수 어휘적 특징이 악성 URL과 양성 URL을 효과적으로 구분할 수 있는가?
RQ2백싱 기반 앙상블 모델은 다른 기계학습 접근 방식에 비해 악성 URL 탐지에서 어떤 성능을 보이는가?
RQ3동적 크롤링이나 외부 블랙리스트 없이도 정적 어휘 기반 방법이 높은 민감도를 달성할 수 있는 정도는 어느 정도인가?
RQ4이러한 모델이 실세계의 생산 환경에서의 보안 시스템에서 탐지 성능에 어떤 영향을 미치는가?
RQ5다양하고 실제적인 URL 데이터셋에 걸쳐 모델의 성능이 어떻게 확장되는가?

주요 결과

모델은 평균 가짜 음성 비율(FNR)이 0.1%로 악성 URL의 거의 완전한 탐지를 나타낸다.
다섯 개의 다른 테스트 세트에서 평균 정확도가 92%로 높은 일반화 능력을 보였다.
ROC 곡선 아래 면적(AUC) 평균이 0.98로 악성 URL과 양성 URL 간의 우수한 분류 성능을 나타낸다.
FireEye의 고급 URL 탐지 엔진에 통합한 결과, 악성 URL 탐지 수가 22% 증가했다.
이 방법은 효과적이고 효율적이었으며, 동적 분석이나 외부 조회 없이도 저지연, 실시간 결론 도출이 가능했다.
결과는 어휘적 특징만으로도 고정확도의 악성 URL 탐지 기반으로 신뢰할 수 있음을 강력히 입증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.