Skip to main content
QUICK REVIEW

[論文レビュー] Malware Detection with LSTM using Opcode Language

Renjie Lu|arXiv (Cornell University)|Jun 10, 2019
Advanced Malware Detection Techniques参考文献 20被引用数 35
ひとこと要約

本論文はマルウェアの opcode sequences を言語として扱い、word embeddings と mean-pooling を用いた two-stage LSTM によって静的解析設定で Windows executables を検出・分類する。969 malware と 123 benign ファイルのデータセットに対して、検出での AUC は約 0.99、分類での AUC は約 0.987 という高い値を報告する。

ABSTRACT

Nowadays, with the booming development of Internet and software industry, more and more malware variants are designed to perform various malicious activities. Traditional signature-based detection methods can not detect variants of malware. In addition, most behavior-based methods require a secure and isolated environment to perform malware detection, which is vulnerable to be contaminated. In this paper, similar to natural language processing, we propose a novel and efficient approach to perform static malware analysis, which can automatically learn the opcode sequence patterns of malware. We propose modeling malware as a language and assess the feasibility of this approach. First, We use the disassembly tool IDA Pro to obtain opcode sequence of malware. Then the word embedding technique is used to learn the feature vector representation of opcode. Finally, we propose a two-stage LSTM model for malware detection, which use two LSTM layers and one mean-pooling layer to obtain the feature representations of opcode sequences of malwares. We perform experiments on the dataset that includes 969 malware and 123 benign files. In terms of malware detection and malware classification, the evaluation results show our proposed method can achieve average AUC of 0.99 and average AUC of 0.987 in best case, respectively.

研究の動機と目的

  • Motivate static malware analysis as a language modeling problem by treating opcodes as words.
  • Automatically learn opcode representations via word embedding to reduce manual feature engineering.
  • Develop a two-stage LSTM architecture with mean-pooling to capture opcode sequence patterns for detection and classification.
  • Evaluate the approach on a mixed malware/benign dataset and compare with other neural models.

提案手法

  • Disassemble executables with IDA Pro to extract opcode sequences from the .text segment.
  • Build a 391-opcode vocabulary and use CBOW word embeddings of dimension 100 to represent opcodes.
  • Apply a two-stage LSTM where the first stage encodes opcode vectors into function-level representations and the second stage encodes function representations into an overall sample representation.
  • Add a mean-pooling layer after the second LSTM to increase feature representation invariance.
  • Use a softmax classifier on the final representation for malware detection (binary) and malware classification (multi-class).

実験結果

リサーチクエスチョン

  • RQ1Can opcode sequences learned via CBOW embeddings capture discriminative patterns for malware vs. benign files?
  • RQ2Does a two-stage LSTM with mean-pooling outperform single-stage LSTM or other neural baselines for malware detection and classification?
  • RQ3What is the impact of word embedding method (CBOW vs. Skip-gram) and window size on detection accuracy?
  • RQ4How does the proposed approach compare to CNN, RNN, and MLP baselines in terms of accuracy and training time?

主な発見

  • CBOW with a window size of 10 yields the best performance among tested embedding configurations.
  • The two-stage LSTM consistently improves detection and classification over a single-stage or no-second-LSTM baseline.
  • Two-stage LSTM outperforms RNN and MLP, and is competitive with CNN while requiring less training time.
  • On the dataset of 969 malware and 123 benign files, the method achieved average AUC of 0.99 for malware detection and 0.987 for malware classification.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。