QUICK REVIEW

[論文レビュー] eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys

Joshua Saxe, Konstantin Berlin|arXiv (Cornell University)|Feb 27, 2017

Network Security and Intrusion Detection参考文献 20被引用数 101

ひとこと要約

eXpose は、学習可能な埋め込みを用いた文字レベル CNN を用いて、raw strings から悪意のあるURL、ファイルパス、レジストリキーを自動検出し、低い偽陽性率で手動特徴量ベースラインを上回る。

ABSTRACT

For years security machine learning research has promised to obviate the need for signature based detection by automatically learning to detect indicators of attack. Unfortunately, this vision hasn't come to fruition: in fact, developing and maintaining today's security machine learning systems can require engineering resources that are comparable to that of signature-based detection systems, due in part to the need to develop and continuously tune the "features" these machine learning systems look at as attacks evolve. Deep learning, a subfield of machine learning, promises to change this by operating on raw input signals and automating the process of feature design and extraction. In this paper we propose the eXpose neural network, which uses a deep learning approach we have developed to take generic, raw short character strings as input (a common case for security inputs, which include artifacts like potentially malicious URLs, file paths, named pipes, named mutexes, and registry keys), and learns to simultaneously extract features and classify using character-level embeddings and convolutional neural network. In addition to completely automating the feature design and extraction process, eXpose outperforms manual feature extraction based baselines on all of the intrusion detection problems we tested it on, yielding a 5%-10% detection rate gain at 0.1% false positive rate compared to these baselines.

研究の動機と目的

生の文字列入力に対してエンドツーエンドの深層学習を用いることにより、サイバーセキュリティ検知における手動特徴量設計を排除する。
URLs、ファイルパス、レジストリキーなど複数のアーティファクトタイプを検出する単一モデルを示す。
埋め込み＋CNN特徴量が、伝統的なn-gramおよび専門家特徴ベースを大規模なセキュリティデータセットで上回ることを示す。
実運用を想定した偽陽性率で性能を評価し、実用性を評価する。

提案手法

学習可能な埋め込みを用いて、生の文字列を密なベクトル空間に埋め込む。
埋め込みシーケンス上でマルチカーネルの1D畳み込みを適用して局所的パターンを検出する。
畳み込み活性化をsum poolingで集約して一定長の特徴量を生成する。
ドロップアウトとバッチ正規化を用いた密な分類器で特徴量を組み合わせ、過学習を抑制する。
Adamでエンドツーエンドに学習し、クラス不均衡に対処するために benign/malicious バッチのバランスを取る。
ROC/AUC 指標を用いて n-gram ベースラインおよび専門家特徴と比較する。

実験結果

リサーチクエスチョン

RQ1埋め込みを伴うエンドツーエンドの文字レベル CNN が、生の文字列から悪意のあるセキュリティアーティファクトを検出する有用な表現を学習できるか？
RQ2埋め込み＋畳み込み特徴量が、URL、ファイルパス、レジストリキー全体で手動特徴抽出ベースラインを上回るか？
RQ3セキュリティアーティファクトに一般的な短い文字列の難読化や変動に対してこのアプローチは頑健か？

主な発見

Task	Model	TPR @ 1e-4	TPR @ 1e-3	TPR @ 1e-2	AUC
URLs	Convnet	0.77	0.84	0.92	0.993
URLs	N-gram	0.76	0.78	0.84	0.985
URLs	Expert	0.74	0.78	0.84	0.985
File Paths	Convnet	0.16	0.43	0.68	0.978
File Paths	N-gram	0.18	0.33	0.65	0.972
Registry Keys	Convnet	0.51	0.62	0.86	0.992
Registry Keys	N-gram	0.11	0.49	0.72	0.988

eXpose は単一のアーキテクチャを用いて、3つのアーティファクトタイプ（URLs、ファイルパス、レジストリキー）を一般化する。
偽陽性率が 1e-3 のとき、eXpose はタスク全体で n-gram および expert feature ベースラインより高い TPRA を達成する。
URLs について、eXpose は 0.84 TPRA at 1e-3 と 0.993 AUC を達成し、0.78 TPRA および 0.985 AUC にとどまるベースラインを上回る。
File Paths について、eXpose は 0.43 TPRA at 1e-3 と 0.978 AUC を達成し、ベースラインの 0.33 TPRA および 0.972 AUC を上回る。
Registry Keys について、eXpose は 0.62 TPRA at 1e-3 と 0.992 AUC を達成し、0.49 TPRA および 0.988 AUC のベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。