QUICK REVIEW

[论文解读] Open, Closed, or Small Language Models for Text Classification?

Hao Yu, Zachary Yang|arXiv (Cornell University)|Aug 19, 2023

Topic Modeling被引用 16

一句话总结

较小的监督模型往往能够匹配或超越生成式大语言模型；开源模型经微调后可与闭源模型相抗衡，而最大的闭源模型在最困难的任务上表现出色。

ABSTRACT

Recent advancements in large language models have demonstrated remarkable capabilities across various NLP tasks. But many questions remain, including whether open-source models match closed ones, why these models excel or struggle with certain tasks, and what types of practical procedures can improve performance. We address these questions in the context of classification by evaluating three classes of models using eight datasets across three distinct tasks: named entity recognition, political party prediction, and misinformation detection. While larger LLMs often lead to improved performance, open-source models can rival their closed-source counterparts by fine-tuning. Moreover, supervised smaller models, like RoBERTa, can achieve similar or even greater performance in many datasets compared to generative LLMs. On the other hand, closed models maintain an advantage in hard tasks that demand the most generalizability. This study underscores the importance of model selection based on task requirements

研究动机与目标

评估开源模型在文本分类任务上是否能够与闭源 LLMs 相匹配。
在多数据集和任务上评估三类模型（开放式 LLM、闭源 LLM、RoBERTa）。
确定影响性能与泛化能力的提示与微调策略。
分析不同模型选择的成本与能源影响。

提出的方法

比较模型类型：Llama 2（13B，70B）、GPT-3.5、GPT-4，以及 RoBERTa（123M，354M）在三个任务上的表现。
评估零-shot、少量样本以及微调设置。
对 Llama 2(70B) 在命名实体识别上对联合数据集使用 LoRA 微调。
测试两种提示风格（Serial、JSON）用于 LLM 并分析提示敏感性。
使用任务相关指标（F1、准确率、宏F1）衡量性能。
提供训练与推理的成本及能耗分析。

Figure 1: The training loss curve for supervised finetuning with Llama2 70B Chat on the combined dataset.

实验结果

研究问题

RQ1开源 Llama 2 模型在 NER、意识形态预测和错误信息任务上与闭源 LLMs（GPT-3.5、GPT-4）以及 RoBERTa 相比如何？
RQ2哪些提示、少样本和微调策略能够最大化各模型类别的性能？
RQ3开源模型通过微调是否能够重新具备竞争力，闭源模型在最困难任务上是否仍具优势？
RQ4在实际使用中，各模型类别的相对成本与能源含量是多少？

主要发现

Task	数据集	Llama 2 (13B)	Llama 2 (70B)	GPT-3.5	GPT-4	RoBERTa
NER	CoNLL 2003	57.8 ± 11.5	82.5 ± 5.6	79.8 ± 6.2	–	94.3 ± 3.5
NER	WNUT 2017	35.4 ± 4.7	55.3 ± 4.7	54.6 ± 3.0	65.1 ± 3.0	59.6 ± 3.3
NER	WikiNER-EN	51.3 ± 8.8	76.1 ± 3.6	77.4 ± 0.6	–	96.2 ± 0.1
Explicit Ideology	2020 Election	95.5 ± 1.1	96.3 ± 0.5	97.0 ± 0.8	97.6 ± 0.5	97.3 ± 0.6
COVID-19	COVID-19	90.2 ± 0.9	92.5 ± 1.3	94.7 ± 0.8	95.1 ± 0.6	91.2 ± 0.2
Explicit Ideology	2021 Election	82.1 ± 1.6	85.2 ± 1.0	87.7 ± 1.3	89.4 ± 1.2	95.2 ± 0.7
Implicit Ideology	2020 Election	71.9 ± 1.9	77.2 ± 1.0	92.9 ± 0.5	–	93.0 ± 0.2
Implicit Ideology	COVID-19	44.6 ± 1.6	53.9 ± 1.5	65.9 ± 2.0	68.6 ± 1.9	70.0 ± 2.7
Implicit Ideology	2021 Election	48.8 ± 3.5	55.7 ± 3.3	75.4 ± 1.6	–	82.3 ± 1.1
Misinfo	LIAR	50.0 ± 1.3	49.1 ± 2.5	68.5 ± 3.0	66.3 ± 2.1	61.5 ± 2.1
Misinfo	CT-FAN-22	21.2 ± 3.2	25.4 ± 2.1	43.7 ± 1.9	42.0 ± 2.6	21.6 ± 2.0

小型监督模型如 RoBERTa 常常匹配甚至优于开源 LLMs 与 GPT-3.5，某些情况下甚至逼近 GPT-4。
提示工程显著影响 LLM 性能；JSON 提示提升 GPT-3.5 的少样本结果，而序列化提示在零-shot 时可帮助 Llama 2。
微调后的开源 Llama 2（70B）可超越 GPT-3.5，但在成本、速度和透明度方面 RoBERTa 常常总体上更优。
最大规模的闭源模型在需要广泛泛化的最具挑战性任务上仍然领先（例如某些 CT-FAN-22 错误信息设置）。
开源模型经微调在环境与成本方面具有优势；RoBERTa 在多数任务中展示了最佳的能源效率和成本特征。
RoBERTa 在若干数据集上可达到与生成式 LLM 相近或更好的性能，凸显了辨识式、监督式方法的价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。