QUICK REVIEW

[論文レビュー] Augmentation Scheme for Dealing with Imbalanced Network Traffic Classification Using Deep Learning

Ramin Hasibi, Matin Shokri|arXiv (Cornell University)|Jan 1, 2019

Internet Traffic Analysis and Secure E-voting参考文献 23被引用数 33

ひとこと要約

この論文は、LSTMベースのデータ拡張スキームと KDE ベースの特徴量再現を組み合わせて、不均衡なネットワークトラフィックデータセットをバランスさせ、CRNNベースのトラフィック分類性能を向上させる。

ABSTRACT

One of the most important tasks in network management is identifying different types of traffic flows. As a result, a type of management service, called Network Traffic Classifier (NTC), has been introduced. One type of NTCs that has gained huge attention in recent years applies deep learning on packets in order to classify flows. Internet is an imbalanced environment i.e., some classes of applications are a lot more populated than others e.g., HTTP. Additionally, one of the challenges in deep learning methods is that they do not perform well in imbalanced environments in terms of evaluation metrics such as precision, recall, and $\mathrm{F_1}$ measure. In order to solve this problem, we recommend the use of augmentation methods to balance the dataset. In this paper, we propose a novel data augmentation approach based on the use of Long Short Term Memory (LSTM) networks for generating traffic flow patterns and Kernel Density Estimation (KDE) for replicating the numerical features of each class. First, we use the LSTM network in order to learn and generate the sequence of packets in a flow for classes with less population. Then, we complete the features of the sequence with generating random values based on the distribution of a certain feature, which will be estimated using KDE. Finally, we compare the training of a Convolutional Recurrent Neural Network (CRNN) in large-scale imbalanced, sampled, and augmented datasets. The contribution of our augmentation scheme is then evaluated on all of the datasets through measurements of precision, recall, and F1 measure for every class of application. The results demonstrate that our scheme is well suited for network traffic flow datasets and improves the performance of deep learning algorithms when it comes to above-mentioned metrics.

研究の動機と目的

実世界のネットワークトラフィックデータセットにおける不均衡なクラス分布に対処する。
少数クラスを拡張しつつクラス意味を保持する拡張スキームを開発する。
拡張データがトラフィック分類タスクにおける深層学習分類器の性能を改善するかを評価する。
大規模なトラフィックデータに対して単純な過サンプリング手法との比較を行う。

提案手法

少数クラスのパケット方向およびTCPウィンドウサイズシーケンスを学習し生成するためにLSTMネットワークを使用する。
数値特徴量に対してKernel Density Estimation (KDE) による特徴分布を推定し、これらのPDFからサンプルをとって新しいフローを生成する。
生成したシーケンスとKDEベースの特徴を組み合わせて拡張フローサンプルを構築する（1フローあたり最大20パケット、ゼロ埋めでパディング）。
拡張データを用いて畳み込みリカレントニューラルネットワーク（CRNN）を、ドロップアウトを伴う2つの畳み込み層・LSTM・全結合アーキテクチャでトレーニングし、最後に19クラスでソフトマックスを適用。
実データ、サンプリングデータ、拡張データのそれぞれで、基準データと過サンプリング手法とを比較して、精度・リコール・F1を評価する。

実験結果

リサーチクエスチョン

RQ1LSTMベースのシーケンス生成とKDEベースの特徴再現は、ネットワークトラフィックデータセットのクラス不均衡を緩和できるか？
RQ2拡張は過サンプリングと比較してクラスごとの精度・リコール・F1を改善するか？
RQ3拡張データと非拡張データでCRNNの不均衡トラフィックデータに対するパフォーマンスはどのように変化するか？
RQ4拡張は主要クラスとマイノリティクラスの混同行列における全体的な精度と混乱にどのような影響を及ぼすか？

主な発見

拡張は拡張されたクラスで実データおよび過サンプルデータと比較してリコールを改善する。
全体のF1性能は、すべてのクラスに対して単純なサンプリングよりも拡張で向上している。
拡張データで訓練したCRNNは精度が高く偽陰性を減らし、混同行列が正解予測へシフトしている。
実データと比較して拡張スキームを使用すると精度が6.56ポイント向上する。
一部の高-populationクラスでは精度がわずかに低下する場合があるが、マイノリティクラスのリコールが改善され、全体的な指標が向上する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。