[論文レビュー] Large Language Models Are Zero-Shot Text Classifiers
本論文は、GPTベースのモデルおよび他のベースラインのゼロショットテキスト分類性能を複数のデータセットで評価し、LLMs が四つのタスクのうち三つでゼロショット分類器として機能し得ることを示している。
Retrained large language models (LLMs) have become extensively used across various sub-disciplines of natural language processing (NLP). In NLP, text classification problems have garnered considerable focus, but still faced with some limitations related to expensive computational cost, time consumption, and robust performance to unseen classes. With the proposal of chain of thought prompting (CoT), LLMs can be implemented using zero-shot learning (ZSL) with the step by step reasoning prompts, instead of conventional question and answer formats. The zero-shot LLMs in the text classification problems can alleviate these limitations by directly utilizing pretrained models to predict both seen and unseen classes. Our research primarily validates the capability of GPT models in text classification. We focus on effectively utilizing prompt strategies to various text classification scenarios. Besides, we compare the performance of zero shot LLMs with other state of the art text classification methods, including traditional machine learning methods, deep learning methods, and ZSL methods. Experimental results demonstrate that the performance of LLMs underscores their effectiveness as zero-shot text classifiers in three of the four datasets analyzed. The proficiency is especially advantageous for small businesses or teams that may not have extensive knowledge in text classification.
研究の動機と目的
- Evaluate zero-shot text classification using GPT, Llama2, GPT-3.5, and GPT-4 across multiple datasets.
- Compare zero-shot LLMs with traditional ML, DL, and ZSL methods under standardized prompts and inputs.
- Provide practical insights for small teams to deploy zero-shot text classification.
- Open-source the codebase to enable reproducibility and community impact.
提案手法
- Describe traditional text classification workflow and contrast with zero-shot prompting using GPT models.
- Standardize prompts and sampling parameters (temperature=0.01, top_p=0.9) across LLM experiments.
- Use raw test data for LLMs while traditional ML/DL models use standardized preprocessed input.
- Evaluate models on four datasets: COVID-19 tweets, economic text, e-commerce text, and SMS spam.
- Compare with MNB, LG, RF, DT, KNN, RNN, LSTM, GRU, BART, DeBERTa, and LLMs (Llama2, GPT-3.5, GPT-4).
- Report ACC, F1, and AUC metrics and analyze zero-shot performance.

実験結果
リサーチクエスチョン
- RQ1Can zero-shot LLMs achieve competitive text classification performance relative to traditional ML/DL methods?
- RQ2How do zero-shot LLMs compare to ZSL baselines on sentiment, topic, and spam tasks?
- RQ3What prompting and input-processing choices maximize zero-shot text classification performance?
- RQ4Are LLMs more advantageous for small teams or businesses with limited labeling capabilities?
- RQ5What are the limitations and biases observed in zero-shot LLM classifications across datasets?
主な発見
| Model | COVID-19 ACC | COVID-19 F1 | COVID-19 AUC | Economic ACC | Economic F1 | Economic AUC |
|---|---|---|---|---|---|---|
| MNB | 0.3933 | 0.3639 | 0.5531 | 0.4533 | 0.3632 | 0.5563 |
| LG | 0.4333 | 0.3488 | 0.5404 | 0.5200 | 0.3066 | 0.5427 |
| RF | 0.4467 | 0.3184 | 0.6184 | 0.5133 | 0.3453 | 0.5990 |
| DT | 0.4733 | 0.4105 | 0.5602 | 0.4067 | 0.3446 | 0.5060 |
| KNN | 0.3800 | 0.3486 | 0.5216 | 0.4800 | 0.3620 | 0.5614 |
| RNN | 0.7400 | 0.7186 | 0.8925 | 0.6333 | 0.5797 | 0.7874 |
| LSTM | 0.7867 | 0.7619 | 0.8925 | 0.6533 | 0.4627 | 0.7293 |
| GRU | 0.8200 | 0.8106 | 0.9226 | 0.6933 | 0.5767 | 0.7928 |
| BART | 0.5000 | 0.3516 | 0.5882 | 0.4600 | 0.4258 | 0.6603 |
| DeBERTa | 0.5467 | 0.3805 | 0.5954 | 0.4467 | 0.4251 | 0.6385 |
| Llama2 | 0.5267 | 0.4748 | - | 0.7000 | 0.5230 | - |
| GPT-3.5 | 0.5333 | 0.4943 | - | 0.6667 | 0.6683 | - |
| GPT-4 | 0.5267 | 0.5095 | - | 0.7133 | 0.7096 | - |
- LLMs achieve strong performance as zero-shot text classifiers in three out of four datasets.
- GPT-4 generally outperforms other baselines across datasets, especially in economic text and SMS tasks.
- Llama2 and GPT-3.5 show strengths in sentiment analysis and e-commerce classification, while LLMs lag behind in some COVID-19 sentiment cases.
- Traditional ML methods often underperform compared to DL and LLM-based zero-shot approaches in several tasks.
- Prompting strategy and data preprocessing (e.g., cleaning tweets) significantly affect model outputs and accuracy.
- Open-sourced code and standardized evaluation enable reproducibility and broader adoption.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。