QUICK REVIEW

[논문 리뷰] URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection

H Le, Quang Pham|arXiv (Cornell University)|2018. 02. 09.

Spam and Phishing Detection참고 문헌 34인용 수 193

한 줄 요약

URLNet은 문자와 단어에 이중 CNN 가지를 사용하여 악성 URL을 탐지하기 위한 강인한 URL 표현을 학습하고, 전통적 어휘 특징 기반의 기준선을 능가하며 보지 못한/희귀한 단어를 효과적으로 처리합니다.

ABSTRACT

Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet.

연구 동기 및 목표

블랙리스트와 수동으로 설계된 어휘 특징을 넘어서는 강건한 악성 URL 탐지를 동기화한다.
원시 URL 문자열에서 직접 URL 임베딩을 학습하는 엔드-투-엔드 딥러닝 모델을 제안한다.
문자- 수준 및 단어- 수준 CNN을 통해 URL의 의미론적 및 순차적 패턴을 포착하고 고급 어휘 임베딩을 활용한다.
대규모 URL 데이터셋에서 희귀/ unseen 단어 문제와 메모리 제약을 해결한다.
강력한 어휘 기반 기준선과 대조하여 URLNet을 평가하고 구성 요소의 기여를 이해하기 위한 차별 실험을 수행한다.

제안 방법

URLNet을 두 개의 CNN 가지(문자 수준 및 단어 수준 표현)로 도입한다.
두 가지 가지 모두에서 서로 다른 필터 크기 h를 {3,4,5,6}로 설정하고 크기당 256개의 필터를 사용한다.
단어의 경우 희귀하고 unseen 단어를 처리하기 위해 단어 수준 정보와 문자 수준 정보를 결합하는 고급 임베딩을 사용한다.
추가 시퀀스 정보를 포착하기 위해 특수 문자를 단어로 취급한다.
드롭아웃과 Adam 옵티마이저를 사용하여 엔드-투-엔드로 학습하며, 분기 출력을 최종 밀집층 전에 연결한다.

실험 결과

연구 질문

RQ1URLNet이 악성 URL 탐지에서 전통적인 Bag-of-Words 기반 어휘 특징 및 수공으로 설계된 어휘 기반 기준선을 능가할 수 있는가?
RQ2문자 수준, 단어 수준, 전체 URLNet 변형 간의 비교와 결합된(URLNet Full) 모델의 기여는 무엇인가?
RQ3문자 기반 단어 임베딩과 특수 문자 처리로 unseen/희귀 단어에 일반화될 수 있는가?
RQ4학습 데이터 규모가 URLNet의 성능에 미치는 영향은 무엇이며, 서로 다른 특징 구조가 강건성에 어떻게 기여하는가?

주요 결과

URLNet 변형은 AUC 및 TPR@FPR에서 베이스라인 어휘 특징 모델을 크게 능가합니다.
문자-단어 CNN을 결합한 URLNet Full이 가장 강력하고 일관된 성능 이점을 제공합니다.
문자 수준과 단어 수준 CNN은 보완적 강점을 제공하며, Full 모델은 다양한 FPR에서 두 가지를 모두 활용해 성능을 향상합니다.
문자 정보를 통합한 고급 단어 임베딩은 메모리 제약을 완화하고 unseen 단어 처리 가능성을 높입니다.
학습 데이터를 100만에서 500만 URL로 증가시킬 때 모든 지표에서 성능이 향상됩니다.
문자 수준 CNN은 긴 시퀀스의 패턴 파악에 뛰어나고, 단어 수준 CNN은 토큰 수준의 의미를 포착하며, 이들의 결합이 각각을 능가합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.