QUICK REVIEW

[论文解读] FinBen: A Holistic Financial Benchmark for Large Language Models

Qianqian Xie, Weiguang Han|arXiv (Cornell University)|Feb 20, 2024

FinTech, Crowdfunding, Digital Finance被引用 21

一句话总结

FinBen引入一个全面的开源基准，涵盖35个数据集，覆盖23个金融任务，分为三个受CHC启发的谱系，用以从归纳到交易评估LLMs。

ABSTRACT

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.

研究动机与目标

激励建立一个覆盖广泛、现实世界金融环境的LLMs评估基准的需求。
设计FinBen以覆盖语言理解、知识抽取、数值推理、生成、预测和交易等任务。
提供一个开放源代码的框架，涵盖多样的数据模态，用以衡量LLMs在金融领域的能力。
评估15个具有代表性的LLMs，以识别在金融任务中的优势与局限。
提出受CHC启发的谱系，用以在金融领域映射从基本到通用智能的认知能力。

提出的方法

构建FinBen，覆盖23个金融任务的35个数据集。
将任务组织为三种与CHC理论相呼应的谱系：Spectrum I（Quantification、Extraction、Numerical Understanding）、Spectrum II（Generation、Forecasting）、Spectrum III（Stock Trading）。
评估15个LLM的零-shot与少量-shot性能，包括GPT-4、ChatGPT、Gemini以及开源模型。
对每个任务使用标准度量（如F1、准确率、RMSE、ROUGE/BERTScore/BARTScore、MCC、EMAcc）以及交易指标（CR、SR、DV、AV、MD）。
在不同任务间比较性能，以识别在哪些任务中指令微调有帮助，以及仍存在哪些差距。

实验结果

研究问题

RQ1FinBen是否能够在金融领域提供超越现有以NLP为中心的基准的广泛现实世界评估？
RQ2当前LLMs在哪些金融任务上表现出色，在哪些任务上存在挑战（如复杂提取、数值推理、预测）？
RQ3不同模型家族（GPT-4、Gemini、开源LLM）在三种受CHC启发的谱系上的比较如何？
RQ4指令微调是否在所有任务上同样提升性能，还是仅在简单任务上有效？

主要发现

模型	CR (%) ↑	SR ↑	DV (%) ↓	AV (%) ↓	MD (%) ↓
Buy and Hold	-4.83 ± 18.9	0.0541 ± 0.647	3.68 ± 1.18	58.3 ± 18.8	35.3 ± 13
GPT-4	28.3 ± 12.5	1.42 ± 0.575	2.78 ± 0.949	44.1 ± 15	18.5 ± 6.92
ChatGPT	5.46 ± 15.5	0.139 ± 0.755	3.14 ± 1.16	49.9 ± 18.5	32.1 ± 10.3
LLaMA2-70B	4.07 ± 20.2	0.486 ± 1.12	2.41 ± 0.873	38.2 ± 13.9	23.1 ± 11.9
Gemini	21 ± 21.6	0.861 ± 0.832	2.5 ± 1.23	39.7 ± 19.6	22.5 ± 7.9

GPT-4在量化、提取、数值推理和股票交易方面处于领先；Gemini在生成和预测方面表现突出。
指令微调提升简单任务的性能，但在复杂数值推理、生成与预测方面效果较差。
开源/中文微调模型在某些分类任务上表现强劲，但跨语言效应和数据集对齐会影响结果。
股票交易任务揭示LLMs的通用智能能力，在所评估模型中GPT-4实现了最高的夏普比率和最低的最大回撤。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。