Skip to main content
QUICK REVIEW

[论文解读] Augmentation Scheme for Dealing with Imbalanced Network Traffic Classification Using Deep Learning

Ramin Hasibi, Matin Shokri|arXiv (Cornell University)|Jan 1, 2019
Internet Traffic Analysis and Secure E-voting参考文献 23被引用 33
一句话总结

论文提出一种基于 LSTM 的数据增强方案,结合 KDE-based 特征复制,以平衡不平衡的网络流量数据集并提高 CRNN 基于流量分类的性能。

ABSTRACT

One of the most important tasks in network management is identifying different types of traffic flows. As a result, a type of management service, called Network Traffic Classifier (NTC), has been introduced. One type of NTCs that has gained huge attention in recent years applies deep learning on packets in order to classify flows. Internet is an imbalanced environment i.e., some classes of applications are a lot more populated than others e.g., HTTP. Additionally, one of the challenges in deep learning methods is that they do not perform well in imbalanced environments in terms of evaluation metrics such as precision, recall, and $\mathrm{F_1}$ measure. In order to solve this problem, we recommend the use of augmentation methods to balance the dataset. In this paper, we propose a novel data augmentation approach based on the use of Long Short Term Memory (LSTM) networks for generating traffic flow patterns and Kernel Density Estimation (KDE) for replicating the numerical features of each class. First, we use the LSTM network in order to learn and generate the sequence of packets in a flow for classes with less population. Then, we complete the features of the sequence with generating random values based on the distribution of a certain feature, which will be estimated using KDE. Finally, we compare the training of a Convolutional Recurrent Neural Network (CRNN) in large-scale imbalanced, sampled, and augmented datasets. The contribution of our augmentation scheme is then evaluated on all of the datasets through measurements of precision, recall, and F1 measure for every class of application. The results demonstrate that our scheme is well suited for network traffic flow datasets and improves the performance of deep learning algorithms when it comes to above-mentioned metrics.

研究动机与目标

  • 解决现实世界网络流量数据集中类别分布不平衡的问题。
  • 开发一种增强方案,在扩展少数类的同时保留类别语义。
  • 评估增强数据是否提升用于流量分类任务的深度学习分类器的性能。
  • 在大规模流量数据上将增强与简单的过采样方法进行比较。

提出的方法

  • 使用 LSTM 网络学习并生成少数类的分组方向和 TCP 窗口大小序列。
  • 对数值特征使用 Kernel Density Estimation (KDE) 估计特征分布,并从这些概率密度函数中取样以生成新的流。
  • 将生成的序列和 KDE-based 特征整合为增强后的流样本(每个流最多 20 个分组,使用零填充)。
  • 在增强数据上训练一个 Convolutional Recurrent Neural Network (CRNN),采用两层卷积层、LSTM、带 dropout 的全连接结构,并在 19 个类别上进行 softmax。
  • 通过在实际、采样和增强数据集上,与基线和过采样方法比较精确率、召回率和 F1 来评估增强效果。

实验结果

研究问题

  • RQ1基于 LSTM 的序列生成结合 KDE-based 特征复制,是否能够缓解网络流量数据集中的类别不平衡?
  • RQ2相比于过采样,增强是否提升了各类别的精度、召回率和 F1?
  • RQ3在不平衡的流量数据上,当在增强数据与非增强数据上训练 CRNN 时,性能有何变化?
  • RQ4增强对总体准确率以及主要与次要类别之间的混淆有何影响?

主要发现

  • 与实际数据集和过采样数据集相比,增强在增强类别上的召回率有所提高。
  • 在所有类别上,使用增强的总体 F1 表现优于简单采样。
  • 在增强数据上训练的 CRNN 取得更高的准确率并降低了误判为假阴性的情况,混淆矩阵向正确预测的方向移动可见。
  • 使用增强方案相比实际数据集,准确率提升了 6.56 个百分点。
  • 在某些高占比类别中,精度可能略有下降,但少数类的召回率提升,有助于提高总体指标。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。