QUICK REVIEW

[논문 리뷰] Augmentation Scheme for Dealing with Imbalanced Network Traffic Classification Using Deep Learning

Ramin Hasibi, Matin Shokri|arXiv (Cornell University)|2019. 01. 01.

Internet Traffic Analysis and Secure E-voting참고 문헌 23인용 수 33

한 줄 요약

본 논문은 LSTM 기반 데이터 보강 방식과 KDE 기반 특징 재현을 결합하여 불균형한 네트워크 트래픽 데이터셋의 균형을 맞추고 CRNN 기반 트래픽 분류 성능을 향상시키는 방법을 제시한다.

ABSTRACT

One of the most important tasks in network management is identifying different types of traffic flows. As a result, a type of management service, called Network Traffic Classifier (NTC), has been introduced. One type of NTCs that has gained huge attention in recent years applies deep learning on packets in order to classify flows. Internet is an imbalanced environment i.e., some classes of applications are a lot more populated than others e.g., HTTP. Additionally, one of the challenges in deep learning methods is that they do not perform well in imbalanced environments in terms of evaluation metrics such as precision, recall, and $\mathrm{F_1}$ measure. In order to solve this problem, we recommend the use of augmentation methods to balance the dataset. In this paper, we propose a novel data augmentation approach based on the use of Long Short Term Memory (LSTM) networks for generating traffic flow patterns and Kernel Density Estimation (KDE) for replicating the numerical features of each class. First, we use the LSTM network in order to learn and generate the sequence of packets in a flow for classes with less population. Then, we complete the features of the sequence with generating random values based on the distribution of a certain feature, which will be estimated using KDE. Finally, we compare the training of a Convolutional Recurrent Neural Network (CRNN) in large-scale imbalanced, sampled, and augmented datasets. The contribution of our augmentation scheme is then evaluated on all of the datasets through measurements of precision, recall, and F1 measure for every class of application. The results demonstrate that our scheme is well suited for network traffic flow datasets and improves the performance of deep learning algorithms when it comes to above-mentioned metrics.

연구 동기 및 목표

현실 세계의 네트워크 트래픽 데이터셋에서 클래스 불균형 분포를 해결한다.
소수 클래스를 확장하면서 클래스 의미를 보존하는 증강 체계를 개발한다.
증강 데이터가 트래픽 분류 작업에서 딥러닝 분류기의 성능을 개선하는지 평가한다.
대규모 트래픽 데이터에서 간단한 오버샘플링 방식과의 비교를 수행한다.

제안 방법

소수 클래스에 대해 패킷 방향 및 TCP 윈도우 크기 시퀀스를 학습하고 생성하기 위해 LSTM 네트워크를 사용한다.
수치 특징에 대해 커널 밀도 추정(KDE)으로 특징 분포를 추정하고 이 PDFs에서 샘플링하여 새로운 흐름을 생성한다.
생성된 시퀀스와 KDE 기반 특징을 증강 흐름 샘플로 결합한다(흐름당 최대 20개의 패킷, 나머지는 0으로 패딩).
드롭아웃을 포함한 두 개의 합성곱 층, LSTM 및 완전 연결 계층으로 구성된 CRNN을 증강 데이터에 대해 학습하고, 19개 클래스에 대해 소프트맥스를 적용한다.
실제, 샘플링된 데이터 및 증강 데이터에 대해 기준(BASELINE) 및 오버샘플링 방식과의 정밀도, 재현율, F1을 비교하여 증강을 평가한다.

실험 결과

연구 질문

RQ1LSTM 기반 시퀀스 생성과 KDE 기반 특징 재현이 네트워크 트래픽 데이터셋의 클래스 불균형을 완화할 수 있는가?
RQ2증강이 클래스별 정밀도, 재현율, F1을 오버샘플링과 비교해 향상시키는가?
RQ3증강 데이터로 학습했을 때 불균형 트래픽 데이터에서 CRNN 성능은 비증강 데이터와 비교해 어떻게 변하는가?
RQ4증강이 주요 클래스와 소수 클래스 간의 전체 정확도 및 혼동에 어떤 영향을 미치는가?

주요 결과

증강은 실제 데이터 및 오버샘플링 데이터에 비해 증강된 클래스의 재현율을 향상시킨다.
전반적인 F1 성능은 간단한 샘플링보다 증강으로 더 좋다.
증강 데이터로 학습된 CRNN은 더 높은 정확도와 줄어든 거짓 부정으로, 혼동 행렬이 올바른 예측 쪽으로 이동하는 것이 보인다.
증강 체계를 실제 데이터셋 대비 정확도가 6.56 퍼센트 포인트 증가한다.
일부 대다수 클래스에서 정밀도는 다소 감소할 수 있지만 소수 클래스의 재현율은 개선되어 전체 지표가 더 높아진다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.