QUICK REVIEW

[论文解读] Large language models for aspect-based sentiment analysis

Simmering, Paul F., Paavo Huoviala|arXiv (Cornell University)|Oct 27, 2023

Topic Modeling被引用 12

一句话总结

微调的 GPT-3.5 在 SemEval-2014 ABSA 联合任务上实现了 83.8 的 F1 的 state-of-the-art；与 GPT-4 和 InstructABSA 相比，成本-性能比具有优势；零-shot/少-shot 提示在某些设置下有帮助，但微调在很大程度上减少了对提示的需求。

ABSTRACT

Large language models (LLMs) offer unprecedented text completion capabilities. As general models, they can fulfill a wide range of roles, including those of more specialized models. We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings on the aspect-based sentiment analysis (ABSA) task. Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task of the SemEval-2014 Task 4, improving upon InstructABSA [@scaria_instructabsa_2023] by 5.7%. However, this comes at the price of 1000 times more model parameters and thus increased inference cost. We discuss the the cost-performance trade-offs of different models, and analyze the typical errors that they make. Our results also indicate that detailed prompts improve performance in zero-shot and few-shot settings but are not necessary for fine-tuned models. This evidence is relevant for practioners that are faced with the choice of prompt engineering versus fine-tuning when using LLMs for ABSA.

研究动机与目标

评估 GPT-4 和 GPT-3.5 在零-shot、少-shot 和微调设置下，对 ABSA 的联合特征项提取和极性分类任务（ATE 与极性）在 SemEval-2014 上的表现。

提出的方法

在 SemEval-2014 ABSA 基准测试（针对笔记本和餐厅评测）上以 F1 为主要指标进行评估。
测试多种提示变体、上下文示例数量，以及通过 OpenAI API 对 GPT-3.5 进行微调。
使用 JSON 结构与 OpenAI 函数调用来标准化模型输出。
将 InstructABSA 作为强基线进行对比。
分析模型的错误类型和成本-性能权衡。
考察微调模型是否需要详细提示。

Figure 1: Overview of experimental setup

实验结果

研究问题

RQ1GPT-4、GPT-3.5（零-shot/少-shot）和微调后的 GPT-3.5 在 SemEval-2014 的联合 ATE 与极性任务上对 ABSA 的表现如何？
RQ2提示设计、上下文示例和微调对 ABSA 表现与成本的影响是什么？
RQ3微调后的 GPT-3.5 是否在 SemEval-2014 ABSA 联合任务上超越先前的最优结果（InstructABSA）？
RQ4LLM 在 ABSA 中表现出哪些错误模式，微调如何影响假阳性与假阴性？
RQ5使用提示与微调进行 ABSA 的经济权衡（成本 vs. F1）是什么？

主要发现

微调的 GPT-3.5 在 SemEval-2014 联合 ABSA 任务上达到 83.8 的 F1，超越 InstructABSA 5.7%。
GPT-3.5 与 GPT-4 具有不同的成本-性能特征，微调后的 GPT-3.5 在较低成本下提供强大表现。
详细提示在某些设置中提升了零-shot 与少-shot 的结果，但对微调模型不是必需的。
零-shot 的 GPT-4 可能不如 GPT-3.5，而上下文中的示例使 GPT-4 达到具有竞争力的水平。
错误分析显示，主要收益来自降低对非金标准的特征项的误报，以及在正确提取术语时改进极性分类。
微调的 GPT-3.5 将某些子类型的假阳性降低高达约 88%，并在特征项提取上实现更高的精度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。