QUICK REVIEW

[논문 리뷰] Spam Detection Using BERT

Thaer Sahmoud, Mohammad A. Mikki|arXiv (Cornell University)|2022. 06. 06.

Spam and Phishing Detection인용 수 24

한 줄 요약

이 논문은 사전 학습된 BERT 모델을 사용해 스팸 탐지기를 구축하고 여러 코퍼스에서 평가하여 SMS 및 이메일 데이터셋 전반에서 높은 정확도를 달성합니다. 스팸과 합법 메시지 간의 맥락 인식 분류가 강함을 보여줍니다.

ABSTRACT

Emails and SMSs are the most popular tools in today communications, and as the increase of emails and SMSs users are increase, the number of spams is also increases. Spam is any kind of unwanted, unsolicited digital communication that gets sent out in bulk, spam emails and SMSs are causing major resource wastage by unnecessarily flooding the network links. Although most spam mail originate with advertisers looking to push their products, some are much more malicious in their intent like phishing emails that aims to trick victims into giving up sensitive information like website logins or credit card information this type of cybercrime is known as phishing. To countermeasure spams, many researches and efforts are done to build spam detectors that are able to filter out messages and emails as spam or ham. In this research we build a spam detector using BERT pre-trained model that classifies emails and messages by understanding to their context, and we trained our spam detector model using multiple corpuses like SMS collection corpus, Enron corpus, SpamAssassin corpus, Ling-Spam corpus and SMS spam collection corpus, our spam detector performance was 98.62%, 97.83%, 99.13% and 99.28% respectively. Keywords: Spam Detector, BERT, Machine learning, NLP, Transformer, Enron Corpus, SpamAssassin Corpus, SMS Spam Detection Corpus, Ling-Spam Corpus.

연구 동기 및 목표

이메일 및 SMS 커뮤니케이션의 증가에 따른 효과적인 스팸 탐지의 필요성 제기.
메시지의 맥락적 신호를 포착하기 위한 BERT 기반 스팸 탐지 방법 제안.
다수의 공개 코퍼스에서 모델을 평가하여 도메인 간 일반화를 입증.

제안 방법

스팸 탐지용 분류기로 사전 학습된 BERT 모델 활용.
다수의 코퍼라에서 학습 및 평가: SMS 수집 코퍼스, Enron 코퍼스, SpamAssassin 코퍼스, Ling-Spam 코퍼스, 그리고 SMS 스팸 수집 코퍼스.
각 코퍼스에서 정확도 성능 지표를 보고하여 효과성 입증.

실험 결과

연구 질문

RQ1다양한 코퍼스에서 BERT 기반 모델이 스팸과 합법 메시지 분류에서 높은 정확도를 달성할 수 있는가?
RQ2SMS 대 이메일 데이터셋에서 모델의 성능은 기존 기준선 대비 어떠한가?
RQ3데이터셋별 튜닝 없이도 다수의 스팸 데이터세트에서 접근 방식이 일반화될 수 있는가?

주요 결과

모델은 하나의 코퍼스에서 98.62% 정확도, 다른 코퍼스에서 97.83%, 세 번째에서 99.13%, 네 번째에서 99.28% 정확도를 달성했다.
Enron, SpamAssassin, Ling-Spam, 및 SMS 스팸 데이터 세트를 포함한 SMS 및 이메일 수집에서 높은 성능이 입증되었다.
초록에서 하한선 벤치마크를 보고하지 않고도 BERT를 이용한 강력한 맥락 인식 스팸 탐지를 시연한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.