QUICK REVIEW

[論文レビュー] URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection

H Le, Quang Pham|arXiv (Cornell University)|Feb 9, 2018

Spam and Phishing Detection参考文献 34被引用数 193

ひとこと要約

URLNet は文字と単語の二重 CNN ブランチ（高度な語彙埋め込みを用いる）を用いて、悪意のある URL を検出するための堅牢な URL 表現を学習し、従来の語彙特徴ベースを上回り、未知語/まれな語を効果的に処理します。

ABSTRACT

Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet.

研究の動機と目的

ブラックリストや手作業で設計された語彙特徴を超える堅牢な悪質 URL 偵出を動機づける。
raw URL strings から直接 URL の埋め込みを学習するエンドツーエンドの深層学習モデルを提案する。
文字レベルと語彙レベルの CNN と高度な語彙埋め込みを用いて、 URL の意味論的・逐次的パターンを捕捉する。
大規模な URL データセットにおける希少語/見慣れない語とメモリ制約の問題に対処する。
強力な語彙ベースラインと比較して URLNet を評価し、部品の寄与を理解するアブレーション研究を行う。

提案手法

URLNet を導入し、URL の文字レベルと語彙レベルの表現という二つの CNN ブランチを用いる。
各ブランチでフィルターサイズを複数用い、(h in {3,4,5,6})、サイズあたり 256 個のフィルターを使用。
語彙について、希少語・見慣れない語に対処するため語彙レベル情報と文字レベル情報を組み合わせた高度な埋め込みを採用する。
追加の逐次情報を捕捉するために特殊文字を語として扱う。
ドロップアウトと Adam オプティマイザでエンドツーエンドに訓練し、最終的な密結合層の前にブランチ出力を連結する。

実験結果

リサーチクエスチョン

RQ1URLNet は従来の Bag-of-Words ベースの語彙特徴や手作業の語彙ベースを用いた悪質 URL 偵出に対して上回ることができるか？
RQ2文字レベル、語彙レベル、全体の URLNet バリアントを比較し、結合モデル（URLNet Full）の寄与はどの程度か？
RQ3文字ベースの語彙埋め込みと特殊文字の処理によって、見慣れない語への一般化は達成されるか？
RQ4訓練データサイズが URLNet の性能に与える影響はどの程度か、異なる特徴アーキテクチャは堅牢性にどう寄与するか？

主な発見

URLNet のバリアントは、指標間で AUC および TPR@FPR においてベースライン語彙特徴モデルを大幅に上回る。
文字レベルと語彙レベルの CNN を組み合わせた URLNet Full は、最も強力で一貫した性能向上をもたらす。
文字レベルと語彙レベルの CNN は補完的な強みを提供し、Full モデルは両方を活用してさまざまな FPR で検出を改善する。
文字情報を統合した高度な語彙埋め込みを用いることで memory 制約に対処し、見慣れない語の処理を可能にする。
訓練データを 1M から 5M URL に増やすと、指標全体で性能が向上する。
文字レベル CNN は長いシーケンスのパターンを認識するのに長けており、語彙レベル CNN はトークンレベルの意味を捉える。両方を組み合わせると、各個別よりも優れている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。