QUICK REVIEW

[論文レビュー] Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

Maxime Labonne, Seán Moran|arXiv (Cornell University)|Apr 3, 2023

Spam and Phishing Detection被引用数 13

ひとこと要約

この論文は3つのファミリー（BERT様、Sentence Transformers、Seq2Seq）と伝統的なベースラインをメールのスパム検出でベンチマークし、Few-shot シナリオで優れる Spam-T5 をファインチューニングした Flan-T5 を導入します。

ABSTRACT

This paper investigates the effectiveness of large language models (LLMs) in email spam detection by comparing prominent models from three distinct families: BERT-like, Sentence Transformers, and Seq2Seq. Additionally, we examine well-established machine learning techniques for spam detection, such as Naïve Bayes and LightGBM, as baseline methods. We assess the performance of these models across four public datasets, utilizing different numbers of training samples (full training set and few-shot settings). Our findings reveal that, in the majority of cases, LLMs surpass the performance of the popular baseline techniques, particularly in few-shot scenarios. This adaptability renders LLMs uniquely suited to spam detection tasks, where labeled samples are limited in number and models require frequent updates. Additionally, we introduce Spam-T5, a Flan-T5 model that has been specifically adapted and fine-tuned for the purpose of detecting email spam. Our results demonstrate that Spam-T5 surpasses baseline models and other LLMs in the majority of scenarios, particularly when there are a limited number of training samples available. Our code is publicly available at https://github.com/jpmorganchase/emailspamdetection.

研究の動機と目的

データ不足、分布シフト、そしてメールにおける敵対的ドリフトの下で効果的なスパム検知の必要性を動機づける。
4つの公開データセットで、複数ファミリーのLLMを伝統的ベースラインと比較評価する。
メールスパム検知に特化した Flan-T5 のファインチューニングモデル Spam-T5 を開発する。
完全学習とFew-shot のパフォーマンスを評価し、データ効率と一般化を理解する。

提案手法

RoBERTa（BERT様）、SetFit（Sentence Transformer）、Flan-T5（Seq2Seq）をNaïve Bayes、Logistic Regression、KNN、SVM、XGBoost、LightGBMと比較する。
各モデルごとにハイパーパラメータ（バッチサイズ、学習率、エポック数）をファインチューニングし、tf-idf ベースのベースラインでは stratified 5-fold クロスバリデーションを用いた特徴選択を行う。
Flan-T5 を「classify as ham or spam:」という分類プレフィックスを追加して Spam-T5 に適応し、出力を二値ラベルに後処理する。
4つのデータセット（Ling-Spam、SMS Spam Collection、SpamAssassin Public Corpus、Enron）を用い、F1、精度、再現率で評価する。
全学習（データの80%）と k を {4,8,16,32,64,128,256,Full} サンプルとしたFew-shot 実験を実施する。

実験結果

リサーチクエスチョン

RQ1先行するLLMと従来のベースラインが、完全学習とFew-shot の両方のスパム検出設定でどのように比較されるか。
RQ2非常に少ないショットのシナリオにおいて、ファインチューニング済みSeq2Seqモデル（Spam-T5）は他のLLMより優れているか。
RQ3LLMs と従来モデルを用いたスパム検出における精度と計算効率のトレードオフはどうなるか。
RQ4異なるスパム/正当なメールの分布を持つデータセットはモデルの性能にどう影響するか。
RQ5Spam-T5 はラベル付きデータが不足している場合に堅牢性を維持できるか。

主な発見

LLMs は一般に SMS および Enron データセットにおいて完全学習設定でベースラインを上回る。
Spam-T5 は完全学習結果全体で最高の F1（平均 0.9742）を達成し、RoBERTa と SetFit がそれぞれ 0.9670 で続く。
Few-shot 設定では Spam-T5 が Very-Few-Shot 性能（4–16 サンプル）を支配し、サイズに対して堅牢性を維持。
データセット間の平均では、ベースラインの中で SVM が最も高い（平均 F1 0.9560）、XGBoost は最も低い（0.8842）。
Spam-T5 は特にラベル付きデータが不足している場合に最も強い性能を示し、Few-shot データ効率を強調。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。