QUICK REVIEW

[论文解读] Sentiment Analysis in the Era of Large Language Models: A Reality Check

Wenxuan Zhang, Yue Deng|arXiv (Cornell University)|May 24, 2023

Topic Modeling被引用 55

一句话总结

本文在13项情感分析任务、26个数据集上评估大型语言模型（LLMs），比较零-shot与少-shot LLM在性能与小型领域微调模型之间的差异，并提出 SentiEval 作为现实情感分析评估的新基准。

ABSTRACT

Sentiment analysis (SA) has been a long-standing research area in natural language processing. It can offer rich insights into human sentiments and opinions and has thus seen considerable interest from both academia and industry. With the advent of large language models (LLMs) such as ChatGPT, there is a great potential for their employment on SA problems. However, the extent to which existing LLMs can be leveraged for different sentiment analysis tasks remains unclear. This paper aims to provide a comprehensive investigation into the capabilities of LLMs in performing various sentiment analysis tasks, from conventional sentiment classification to aspect-based sentiment analysis and multifaceted analysis of subjective texts. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets. Our study reveals that while LLMs demonstrate satisfactory performance in simpler tasks, they lag behind in more complex tasks requiring deeper understanding or structured sentiment information. However, LLMs significantly outperform SLMs in few-shot learning settings, suggesting their potential when annotation resources are limited. We also highlight the limitations of current evaluation practices in assessing LLMs' SA abilities and propose a novel benchmark, extsc{SentiEval}, for a more comprehensive and realistic evaluation. Data and code during our investigations are available at \url{https://github.com/DAMO-NLP-SG/LLM-Sentiment}.

研究动机与目标

评估LLM在从简单的二分类到ABSA和MAST的广泛情感分析任务上的表现。
在同领域数据上，比较零-shot和少-shot的LLM性能与小型、领域微调语言模型。
在LLM时代，批判性评估当前的SA评估实践并提出更全面的基准（SentiEval）。
提供数据和代码以实现可重复性并促进基于LLM的SA的进一步研究。

提出的方法

对13项SA任务在26个数据集上进行系统评估，每个数据集的测试集上限为500个样本。
将开源LLMs（Flan-T5 XXL、Flan-UL2）和OpenAI GPT-3.5系列（ChatGPT, text-davinci-003）与在域内数据上训练的小型语言模型（T5 large）进行比较。
使用零-shot和少-shot提示设计，采用精心设计的提示以确保模型间的一致性；探索多种提示（包括GPT-4生成的提示）以评估提示敏感性。
分析包括标准自动评估指标（如准确率、micro-F1、macro-F1）以及对细粒度ABSA任务的目标性人工评估。
评估ABSA变体（UABSA、ASTE、ASQP）和MAST任务（隐式情感、仇恨言论、讽刺、攻击性语言、立场、比较、情感）。
研究提示设计敏感性及其对ABSA任务的影响；讨论在ChatGPT中观察到的与RLHF相关的偏见（如仇恨言论、讽刺、攻击性语言）。

实验结果

研究问题

RQ1大型语言模型在广泛的情感分析任务上的表现如何？
RQ2在零-shot和少-shot设置中，模型是否在情感分析任务上超越了小型、领域微调模型？
RQ3当前的SA评估实践是否足以评估基于LLM的情感分析，还是需要更全面的基准？
RQ4在将LLMs应用于SA时的局限性和陷阱（如提示敏感性、任务结构）有哪些？

主要发现

LLMs在简单的情感分析任务（如二元情感分类）上展现出令人满意的零-shot表现，但在复杂或结构化任务（如ABSA）上仍落后于微调的小模型。
ChatGPT在零-shot设置下对SC任务大约达到对微调T5模型的97%，在MAST任务约83%，体现了强大的固有情感分析能力，但在结构化输出方面仍有差距。
在少-shot设置中，LLMs在有较少注释的情况下持续优于SLMs，但上下文长度和提示设计可能限制效果。
RLHF对齐的模型（如ChatGPT）在仇恨言论、讽刺和攻击性语言等任务上可能不如一些更大且非RLHF的模型，表明对齐偏见。
提示设计对ABSA类任务影响显著，而对SC任务的敏感性相对较小；人工评估通常在LLM的ABSA上比自动指标表现更好。
作者引入 SentiEval 作为一个基准，使得评估更加全面、覆盖多任务，并减少情感分析测试中的提示设计偏见。
提供用于可重复性的数据和代码，位于作者的仓库（https://github.com/DAMO-NLP-SG/LLM-Sentiment）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。