QUICK REVIEW

[論文レビュー] NtMalDetect: A Machine Learning Approach to Malware Detection Using Native API System Calls

Chan Woo Kim|arXiv (Cornell University)|Feb 15, 2018

Advanced Malware Detection Techniques参考文献 4被引用数 31

ひとこと要約

本論文では、システムコールトレースをテキストドキュメントとして扱い、自然言語処理技術（特にTF-IDF重み付きn-gramと線形SVM）を用いて、悪意あるプログラムと良性プログラムを二値分類する動的マルウェア検出システムNtMalDetectを提案する。システムコールシーケンスを用いて96%の正確性と95%の再現率を達成しており、確率的勾配降下を用いて最適化されたSVMが最も効果的かつ効率的であることが判明した。

ABSTRACT

As computing systems become increasingly advanced and as users increasingly engage themselves in technology, security has never been a greater concern. In malware detection, static analysis, the method of analyzing potentially malicious files, has been the prominent approach. This approach, however, quickly falls short as malicious programs become more advanced and adopt the capabilities of obfuscating its binaries to execute the same malicious functions, making static analysis extremely difficult for newer variants. The approach assessed in this paper is a novel dynamic malware analysis method, which may generalize better than static analysis to newer variants. Inspired by recent successes in Natural Language Processing (NLP), widely used document classification techniques were assessed in detecting malware by doing such analysis on system calls, which contain useful information about the operation of a program as requests that the program makes of the kernel. Features considered are extracted from system call traces of benign and malicious programs, and the task to classify these traces is treated as a binary document classification task of system call traces. The system call traces were processed to remove the parameters to only leave the system call function names. The features were grouped into various n-grams and weighted with Term Frequency-Inverse Document Frequency. This paper shows that Linear Support Vector Machines (SVM) optimized by Stochastic Gradient Descent and the traditional Coordinate Descent on the Wolfe Dual form of the SVM are effective in this approach, achieving a highest of 96% accuracy with 95% recall score. Additional contributions include the identification of significant system call sequences that could be avenues for further research.

研究の動機と目的

オブスクリュート化またはゼロデイマルウェアバージョンの検出において、静的マルウェア解析の限界を克服すること。
自然言語処理分野のドキュメント分類技術が、マルウェア検出のためのシステムコールトレース分析にどの程度有効であるかを評価すること。
悪意ある行動と良性行動を区別するのに最も有用なシステムコールシーケンスを特定すること。
実用的でオープンソースのシステム（NtMalDetect）を構築し、実世界での利用を想定した訓練済み分類器を統合すること。

提案手法

良性および悪意あるプログラムのシステムコールトレースが抽出され、パラメータが除去され、関数名のみが保持される。
特徴量として、1-から10-gramまでのシステムコールシーケンスのn-gramが生成され、希少だが判別力のあるシーケンスを強調するためにTF-IDF重み付けが施される。
さまざまな機械学習モデル（線形SVM（確率的勾配降下および座標降下法で最適化済み）、k-NN、ナイーブベイズ）を用いて二値分類が実行される。
SVM分類器を用いてL1およびL2正則化を適用し、重要なシステムコールパターンを特定する。
最終的なNtMalDetectシステムでは、分類器のアンサンブルを強化することで検出の頑健性を向上させる。
システムはScikit-learnを用いて実装され、GitHub上でオープンソースとして公開された。

実験結果

リサーチクエスチョン

RQ1システムコールトレースは、自然言語処理技術を用いて、マルウェア分類のためのテキストドキュメントとして効果的にモデル化できるか？
RQ2システムコールトレースを健全または悪意あるものとして分類する際、どの機械学習アルゴリズムが最も効果的か？
RQ3どのシステムコールn-gramシーケンスが、悪意ある行動を特定するのに最も判別力があるか？
RQ4異なる正則化および最適化戦略は、分類器の性能および効率にどのように影響を与えるか？

主な発見

確率的勾配降下で最適化された線形SVMは、テストセットで96%の正確性と95%の再現率を達成し、最高の性能を示した。
SGD最適化を施したL2ペナルティSVMは、学習および推論の両方で最も高速であり、テスト時間は0.001秒未満であった。
L1正則化SVMによって特定された最も有用な特徴量には、NtQueryInformationThreadおよびNtMapViewOfSectionの繰り返し呼び出しのパターンが含まれており、マルウェア関連の行動パターンを示唆している。
L2正則化SVMは、NtDelayExecutionおよびNtDeviceIoControlFileを含むシーケンスを強調しており、悪意ある行動の兆候となる可能性がある。
本システムは、未観測のマルウェアバージョンに対しても優れた一般化性能を示し、オブスクリュート化またはゼロデイ攻撃の検出において、従来の静的解析を上回った。
オープンソースのNtMalDetectフレームワークは、訓練済みモデルを実用的なツールに統合し、リアルタイムのマルウェア検出を可能にした。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。