QUICK REVIEW

[论文解读] Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks

Xianzhi Li, Chan, Samuel|arXiv (Cornell University)|May 10, 2023

Stock Market Forecasting Methods被引用 15

一句话总结

论文在五个任务类别的八个金融NLP基准上对 ChatGPT 和 GPT-4 进行实证评估，并将其与领域特定模型及微调基线进行比较，以评估它们在金融领域的优势与局限。

ABSTRACT

The most recent large language models(LLMs) such as ChatGPT and GPT-4 have shown exceptional capabilities of generalist models, achieving state-of-the-art performance on a wide range of NLP tasks with little or no adaptation. How effective are such models in the financial domain? Understanding this basic question would have a significant impact on many downstream financial analytical tasks. In this paper, we conduct an empirical study and provide experimental evidences of their performance on a wide variety of financial text analytical problems, using eight benchmark datasets from five categories of tasks. We report both the strengths and limitations of the current models by comparing them to the state-of-the-art fine-tuned approaches and the recently released domain-specific pretrained models. We hope our study can help understand the capability of the existing models in the financial domain and facilitate further improvements.

研究动机与目标

评估通用型大语言模型（ChatGPT 与 GPT-4）在金融文本分析任务中的有效性。
将它们的表现与领域特定的预训练模型和微调基线进行比较。
识别影响金融NLP任务的优点、局限性和提示策略。
提供在金融领域何时使用LLM与微调领域模型的可操作指导。

提出的方法

使用 gpt-3.5-turbo 和 GPT-4（8k 上下文，部分 FinQA 实验使用 GPT-4 16k）进行零-shot、少-shot 和连锁推理提示。
在八个数据集、五个任务类别上评估：情感分析、分类、NER、关系抽取和问答。
与 FinBert、FinQANet、BloombergGPT 以及诸如用于 NER 的 CRF 和用于 RE 的 Luke-base 等基线进行比较。
应用标准评估指标：准确率、宏F1、宏F1（NER），以及在适用情况下的实体级 F1。
在问答任务中，分析少-shot 与 CoT 提示的影响，并与专门的 FinQANet 变体进行比较。

实验结果

研究问题

RQ1ChatGPT 与 GPT-4 能否在金融NLP基准上超越领域特定的微调模型？
RQ2提示策略（零-shot、少-shot、连锁推理）如何影响金融任务的表现？
RQ3金融任务中的哪些任务（情感、分类、NER、RE、QA）更适合通用型LLM，哪些领域模型仍占据主导？
RQ4在金融领域，通用型LLM在结构化预测与数值推理方面有哪些局限？

主要发现

GPT-4 通常在大多数任务和数据集上优于 ChatGPT 和许多基线。
少-shot，尤其是连锁推理提示，显著提升性能，在问答任务中有时提升 10–30 个百分点。
对于 NER 及部分结构化预测任务，领域微调模型（如 BloombergGPT、FinQANet、Luke-base）仍可能超越通用型LLM。
在问答任务中，GPT-4 常超越其他模型，甚至超过部分微调基线，但专业级准确率（约 90%）仍未达到。
通用型LLM 在若干任务上可以超越领域特定模型，但其优势依赖任务，在所有金融NLP挑战中并非通用。
在将LLM用于金融NLP时，提示策略（少-shot、CoT）被推荐为首选方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。