QUICK REVIEW

[论文解读] Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis

Md. Arid Hasan, Shudipta Das|arXiv (Cornell University)|Aug 21, 2023

Sentiment Analysis and Opinion Mining被引用 25

一句话总结

该论文建立了一个大型孟加拉语情感数据集（MUBASE），并比较了大语言模型的零-shot/少-shot 提示与微调模型，在该任务中发现单语孟加拉语微调模型通常优于大语言模型。

ABSTRACT

The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the recent unprecedented performance of Large Language Models (LLMs) in various applications highlights the need to evaluate them in the context of low-resource languages. In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz, offering a comparative analysis against fine-tuned models. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios. To foster continued exploration, we intend to make this dataset and our research tools publicly available to the broader research community.

研究动机与目标

从社交媒体中创建最大发电孟加拉语情感数据集之一（MUBASE）。
评估大语言模型的零-shot和少-shot 提示（Flan-T5、GPT-4、Bloomz）相对于微调模型的表现。
分析提示设计和模型类型如何影响孟加拉语情感分类绩效。
评估单语孟加拉语模型是否在低资源孟加拉语情感分析中优于多语种或基于LLM的方法。
提供数据集和工具的公开发布计划以促进进一步研究。

提出的方法

从 Facebook 帖子和推文中组装并标注孟加拉语情感数据集（MUBASE），清洗后为 33,606 条。
对 BanglaBERT、mBERT、XLM-RoBERTa、Bloomz 与 BanglaBERT 等模型在孟加拉语数据上进行微调。
使用 GPT 提取嵌入并训练前馈分类器作为基线嵌入方法。
使用 carefully designed Bangla-English prompts 和本地孟加拉语提示，对零-shot 和少-shot 提示的 LLM（Flan-T5、Bloomz、GPT-4）进行评估。
对 GPT-4 与 Bloomz 使用 0-shot 与 3-/5-shot 提示并结合 MMR 选择的示例进行；采用多数投票集成以提升 Bloomz 输出。
与基线（随机、 majority）进行对比，并在分层的训练/开发/测试划分（70/10/20）上报告准确率、加权精确率、召回率和 F1。

实验结果

研究问题

RQ1零-shot 和少-shot LLM 提示在孟加拉语情感分析中的表现如何相较微调模型？
RQ2单语孟加拉语模型（如 BanglaBERT）是否在孟加拉语情感任务中优于多语种或基于LLM的方法？
RQ3提示设计和模型规模对零-shot/少-shot 孟加拉语情感分类有何影响？
RQ4跨模型的集成预测是否能改善基于 LLM 的方法的性能？
RQ5本地语言提示是否与用英语提示对孟加拉语情感分析同样有效？

主要发现

微调模型在所有设置中始终优于零-shot 和少-shot LLM 提示。
基于单语 BanglaBERT 的微调在所测试模型中取得最佳结果。
GPT-4 在零-shot 下表现具备竞争力但并非优于单语微调模型。
Bloomz 在零-shot/少-shot 设置中偶尔优于 GPT-4，但在预测中性类别方面有困难，而 GPT-4 在正向方面也有挑战。
Bloomz 设置的多数投票集成使加权 F1 提升了 5.73 个百分点。
通过将 MUBASE 与 SentiNoB（Bangla NoB）结合进行训练数据增强并微调 BanglaBERT，F1 还获得约额外 1.41% 的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。