QUICK REVIEW

[论文解读] Droidetec: Android Malware Detection and Malicious Code Localization through Deep Learning

Zhuo Ma, Haoran Ge|arXiv (Cornell University)|Feb 10, 2020

Advanced Malware Detection Techniques参考文献 36被引用 47

一句话总结

Droidetec 使用 Skip-Gram API 嵌入和带注意力的 Bi-LSTM 来通过 API 序列检测 Android 恶意软件，并通过突出在方法中的高注意力 API 片段来定位恶意代码。

ABSTRACT

Android malware detection is a critical step towards building a security credible system. Especially, manual search for the potential malicious code has plagued program analysts for a long time. In this paper, we propose Droidetec, a deep learning based method for android malware detection and malicious code localization, to model an application program as a natural language sequence. Droidetec adopts a novel feature extraction method to derive behavior sequences from Android applications. Based on that, the bi-directional Long Short Term Memory network is utilized for malware detection. Each unit in the extracted behavior sequence is inventively represented as a vector, which allows Droidetec to automatically analyze the semantics of sequence segments and eventually find out the malicious code. Experiments with 9616 malicious and 11982 benign programs show that Droidetec reaches an accuracy of 97.22% and an F1-score of 98.21%. In all, Droidetec has a hit rate of 91% to properly find out malicious code segments.

研究动机与目标

推动自动化、细粒度的 Android 恶意软件检测，超越二分类。
从 DEX 字节码中提取有意义的行为序列，将程序执行建模为类语言的序列。
利用注意力机制识别并定位应用内的恶意代码片段。
通过聚焦 API 调用模式实现对混淆具鲁棒性的检测。

提出的方法

通过 use-invoke 指令从 DEX 文件提取 API 调用序列，对 APK 进行预处理。
从根方法开始，通过深度优先调用遍历构建行为序列。
用 Skip-Gram 学习的分布式嵌入来表示 API，从而生成 API 向量。
在 API 向量上训练带注意力层的 Bi-LSTM，以将应用分类为恶意与无害。
使用注意力权重来计算序列表示，并识别前 k 个可疑 API 和方法以进行定位。

实验结果

研究问题

RQ1Can API invocation sequences learned through embedding and Bi-LSTM with attention accurately detect Android malware?
RQ2Can the attention mechanism localize the likely malicious code segments within an app’s APIs and methods?
RQ3How does Droidetec perform across different malware families and against established scanners?
RQ4What is the impact of API vector dimensionality on detection accuracy and efficiency?

主要发现

方法	准确率	F1 分数	误报率
Droidetec	97.22%	98.21%	2.11%
Droid-Sec [22]	96.50%	95.34%	3.69%
Zhao’s [23]	91.92%	90.50%	9.48%
API usage	83.25%	81.56%	16.71%
Permissions	73.11%	70.71%	26.72%

Droidetec achieves 97.22% accuracy and 98.21% F1-score with 2.11% FPR on the evaluated dataset.
The method reaches around 91% hit rate for locating malicious code segments within apps.
Droidetec outperforms Droid-Sec and Zhao et al. on detection metrics (Droidetec: 97.22% accuracy, 98.21% F1; Droid-Sec: 96.50% accuracy, 95.34% F1; Zhao: 91.92% accuracy, 90.50% F1).
API vector dimension impacts: higher dimensions improve accuracy/recall but increase verification time, with 200-dimension vectors offering a favorable balance (97.2% accuracy, 98.2% F1).
Attention distributions differ between benign and malware, and across families, enabling semantic differentiation of malicious behavior.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。