QUICK REVIEW

[論文レビュー] Using Lexical Features for Malicious URL Detection -- A Machine Learning Approach

Apoorva Joshi, Levi Lloyd|arXiv (Cornell University)|Oct 14, 2019

Spam and Phishing Detection被引用数 24

ひとこと要約

この論文では、URL文字列から直接抽出した静的語彙的特徴を用いて、高い感度で悪意あるURLを検出する機械学習アンサンブルモデルを提案する。この手法は、平均0.1%の偽陰性率、92%の精度、0.98のAUCを達成し、最小限の遅延でリアルタイムのメールセキュリティワークフローにおける検出を顕著に向上させる。

ABSTRACT

Malicious websites are responsible for a majority of the cyber-attacks and scams today. Malicious URLs are delivered to unsuspecting users via email, text messages, pop-ups or advertisements. Clicking on or crawling such URLs can result in compromised email accounts, launching of phishing campaigns, download of malware, spyware and ransomware, as well as severe monetary losses. A machine learning based ensemble classification approach is proposed to detect malicious URLs in emails, which can be extended to other methods of delivery of malicious URLs. The approach uses static lexical features extracted from the URL string, with the assumption that these features are notably different for malicious and benign URLs. The use of such static features is safer and faster since it does not involve crawling the URLs or blacklist lookups which tend to introduce a significant amount of latency in producing verdicts. The goal of the classification was to achieve high sensitivity i.e. detect as many malicious URLs as possible. URL strings tend to be very unstructured and noisy. Hence, bagging algorithms were found to be a good fit for the task since they average out multiple learners trained on different parts of the training data, thus reducing variance. The classification model was tested on five different testing sets and produced an average False Negative Rate (FNR) of 0.1%, average accuracy of 92% and average AUC of 0.98. The model is presently being used in the FireEye Advanced URL Detection Engine (used to detect malicious URLs in emails), to generate fast real-time verdicts on URLs. The malicious URL detections from the engine have gone up by 22% since the deployment of the model into the engine workflow. The results obtained show noteworthy evidence that a purely lexical approach can be used to detect malicious URLs.

研究の動機と目的

フィッシングやマルウェア配布を通じた悪意あるURLの脅威の増大に対処する。
動的分析やブラックリスト照会を回避することで、悪意あるURL検出の遅延を低減する。
メールセキュリティワークフローにおける悪意あるURLの見逃しを最小限に抑えるために、検出感度を向上させる。
純粋に語彙的特徴が悪意あるURLと良性URLを区別できるかを実証する。
FireEyeのURL検出エンジンのような生産環境セキュリティシステムに展開可能なスケーラブルでリアルタイムのソリューションを開発する。

提案手法

長さ、特殊文字、数字の頻度、サブドメインパターンなどのURL文字列からの静的語彙的特徴を抽出する。
バギングに基づくアンサンブル学習（例：ランダムフォレストなど）を適用し、ノイズが多く構造のないURLにおいても分散を低減し、耐障害性を向上させる。
さまざまな悪意あるURLパターンに一般化できるように、多様なURLデータセットでモデルを学習する。
偽陰性の最小化を優先するためのしきい値ベースの分類を採用し、高感度の目標と整合させる。
言語的および文法的異常を捉えるために特徴工学を活用する。
リソース集約的なクローリングや外部照会を回避するリアルタイムパイプラインでモデルをデプロイする。

実験結果

リサーチクエスチョン

RQ1URL文字列からの純粋な語彙的特徴は、悪意あるURLと良性URLを効果的に区別できるか？
RQ2バギングに基づくアンサンブルモデルは、他の機械学習アプローチと比較して、悪意あるURL検出においてどの程度の性能を示すか？
RQ3URLクローリングや外部ブラックリストを一切必要としない静的で語彙的ベースの手法が、どの程度高い感度を達成できるか？
RQ4このようなモデルが生産環境のメールセキュリティシステムにおける実世界の検出パフォーマンスに与える影響は何か？
RQ5多様で実世界のURLデータセットにおいて、モデルのパフォーマンスはどのようにスケーリングするか？

主な発見

モデルは平均0.1%の偽陰性率（FNR）を達成し、悪意あるURLのほぼ完全な検出を示した。
5つの異なるテストセットで平均92%の精度を記録し、優れた一般化性能を示した。
受受曲線下の面積（AUC）は平均0.98であり、悪意あるURLと良性URLの間の優れた識別性能を示した。
FireEyeのアドバンスドURL検出エンジンへの統合後、悪意あるURL検出件数が22%増加した。
このアプローチは効果的で効率的であり、動的分析や外部照会を必要とせず、低遅延でリアルタイムの判断を可能にした。
結果は、語彙的特徴そのものだけで、高精度な悪意あるURL検出の信頼できる基盤を提供できることを強く示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。