QUICK REVIEW

[論文レビュー] Malicious URL Detection using Machine Learning: A Survey

Doyen Sahoo, Chenghao Liu|arXiv (Cornell University)|Jan 25, 2017

Spam and Phishing Detection参考文献 193被引用数 275

ひとこと要約

悪意のあるURL検出のための機械学習アプローチを包括的に概説し、特徴表現、学習アルゴリズム、従来のブラックリストを超えたシステム設計の検討事項を詳述します。

ABSTRACT

Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and practical applications. We also discuss practical issues in system design, open research challenges, and point out some important directions for future research.

研究の動機と目的

悪意のあるURL検出を機械学習タスク（二値分類）として形式化する。
特徴表現と学習アルゴリズム別に文献を分類・レビューする。
実践的なシステム設計、未解決の研究課題、および将来の方向性について論じる。

提案手法

URLからの特徴抽出を含む二値分類としての問題定式化を正式化する。
静的（実行可能でない）分析特徴とそれがML性能に与える影響をレビューする。
特徴タイプの分類：ブラックリスト、語彙的、ホストベース、コンテンツベース、その他。
スケーラ性とスパース性への対応を目的としたオンライン学習を含む学習アルゴリズムの議論。
Malicious URL Detectionをサービスとしての提供と実運用上の課題を検討する。

実験結果

リサーチクエスチョン

RQ1静的分析を通じて、悪意のあるURLと benign URL を区別するのに有効な特徴表現は何か。
RQ2どの機械学習アルゴリズムと学習戦略が大規模でスパースなURLデータを最もよく扱えるか。
RQ3Malicious URL Detectionシステムを展開する際の実践的な課題と設計上の考慮点は何か。
RQ4異なる特徴カテゴリ（語彙的、ホストベース、コンテンツベース等）は検出性能にどう寄与するか。

主な発見

機械学習アプローチはブラックリストを越えた新規URLへの一般化を実現できる。
静的分析特徴が中心で、語彙的、ホストベース、コンテンツベースの特徴が性能を推進する。
オンライン学習とスパース性を意識した手法が大規模なURLデータセットのスケーラビリティに対応する。
Malicious URL Detectionのための特徴表現とアルゴリズムの体系的なフレームワークと分類が提供される。
本調査は実践的なシステム設計、未解決の課題、および将来の研究の方向性について議論する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。