Skip to main content
QUICK REVIEW

[论文解读] Spam Detection Using BERT

Thaer Sahmoud, Mohammad A. Mikki|arXiv (Cornell University)|Jun 6, 2022
Spam and Phishing Detection被引用 24
一句话总结

该论文使用预训练的BERT模型构建垃圾邮件检测器,并在多个语料库上进行评估,在短信和电子邮件数据集上取得高准确性。它展示了对垃圾邮件与正常邮件的上下文感知强分类性能。

ABSTRACT

Emails and SMSs are the most popular tools in today communications, and as the increase of emails and SMSs users are increase, the number of spams is also increases. Spam is any kind of unwanted, unsolicited digital communication that gets sent out in bulk, spam emails and SMSs are causing major resource wastage by unnecessarily flooding the network links. Although most spam mail originate with advertisers looking to push their products, some are much more malicious in their intent like phishing emails that aims to trick victims into giving up sensitive information like website logins or credit card information this type of cybercrime is known as phishing. To countermeasure spams, many researches and efforts are done to build spam detectors that are able to filter out messages and emails as spam or ham. In this research we build a spam detector using BERT pre-trained model that classifies emails and messages by understanding to their context, and we trained our spam detector model using multiple corpuses like SMS collection corpus, Enron corpus, SpamAssassin corpus, Ling-Spam corpus and SMS spam collection corpus, our spam detector performance was 98.62%, 97.83%, 99.13% and 99.28% respectively. Keywords: Spam Detector, BERT, Machine learning, NLP, Transformer, Enron Corpus, SpamAssassin Corpus, SMS Spam Detection Corpus, Ling-Spam Corpus.

研究动机与目标

  • 在日益增长的电子邮件和短信通讯量背景下,阐明有效垃圾邮件检测的必要性。
  • 提出基于BERT的垃圾邮件检测方法,以捕捉信息中的上下文线索。
  • 在多个公开语料库上评估模型,以展示跨领域的泛化能力。

提出的方法

  • 将BERT预训练模型用作垃圾邮件检测的分类器。
  • 在多个语料库上进行训练和评估:短信收集语料库、Enron语料库、SpamAssassin语料库、Ling-Spam语料库,以及SMS垃圾邮件收集语料库。
  • 报告每个语料库的性能指标(准确率),以证明有效性。

实验结果

研究问题

  • RQ1基于BERT的模型是否能够在不同语料库中将垃圾邮件与合法信息分类得高准确率?
  • RQ2与现有基线相比,模型在短信与电子邮件数据集上的表现如何?
  • RQ3该方法是否对多种垃圾邮件数据集具有泛化能力,而无需数据集特定的调参?

主要发现

  • 模型在一个语料库上达到98.62%的准确率,在另一个语料库上为97.83%,在第三个为99.13%,在第四个为99.28%。
  • 其在短信和电子邮件集合上展示出高性能,包括Enron、SpamAssassin、Ling-Spam和短信垃圾邮件数据集。
  • 结果展示了使用BERT进行强上下文感知的垃圾邮件检测能力,且摘要中未报告任何下限基线。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。