Skip to main content
QUICK REVIEW

[论文解读] FinBen: A Holistic Financial Benchmark for Large Language Models

Qianqian Xie, Weiguang Han|arXiv (Cornell University)|Feb 20, 2024
FinTech, Crowdfunding, Digital Finance被引用 21
一句话总结

FinBen引入一个全面的开源基准,涵盖35个数据集,覆盖23个金融任务,分为三个受CHC启发的谱系,用以从归纳到交易评估LLMs。

ABSTRACT

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.

研究动机与目标

  • 激励建立一个覆盖广泛、现实世界金融环境的LLMs评估基准的需求。
  • 设计FinBen以覆盖语言理解、知识抽取、数值推理、生成、预测和交易等任务。
  • 提供一个开放源代码的框架,涵盖多样的数据模态,用以衡量LLMs在金融领域的能力。
  • 评估15个具有代表性的LLMs,以识别在金融任务中的优势与局限。
  • 提出受CHC启发的谱系,用以在金融领域映射从基本到通用智能的认知能力。

提出的方法

  • 构建FinBen,覆盖23个金融任务的35个数据集。
  • 将任务组织为三种与CHC理论相呼应的谱系:Spectrum I(Quantification、Extraction、Numerical Understanding)、Spectrum II(Generation、Forecasting)、Spectrum III(Stock Trading)。
  • 评估15个LLM的零-shot与少量-shot性能,包括GPT-4、ChatGPT、Gemini以及开源模型。
  • 对每个任务使用标准度量(如F1、准确率、RMSE、ROUGE/BERTScore/BARTScore、MCC、EMAcc)以及交易指标(CR、SR、DV、AV、MD)。
  • 在不同任务间比较性能,以识别在哪些任务中指令微调有帮助,以及仍存在哪些差距。

实验结果

研究问题

  • RQ1FinBen是否能够在金融领域提供超越现有以NLP为中心的基准的广泛现实世界评估?
  • RQ2当前LLMs在哪些金融任务上表现出色,在哪些任务上存在挑战(如复杂提取、数值推理、预测)?
  • RQ3不同模型家族(GPT-4、Gemini、开源LLM)在三种受CHC启发的谱系上的比较如何?
  • RQ4指令微调是否在所有任务上同样提升性能,还是仅在简单任务上有效?

主要发现

模型CR (%) ↑SR ↑DV (%) ↓AV (%) ↓MD (%) ↓
Buy and Hold-4.83 ± 18.90.0541 ± 0.6473.68 ± 1.1858.3 ± 18.835.3 ± 13
GPT-428.3 ± 12.51.42 ± 0.5752.78 ± 0.94944.1 ± 1518.5 ± 6.92
ChatGPT5.46 ± 15.50.139 ± 0.7553.14 ± 1.1649.9 ± 18.532.1 ± 10.3
LLaMA2-70B4.07 ± 20.20.486 ± 1.122.41 ± 0.87338.2 ± 13.923.1 ± 11.9
Gemini21 ± 21.60.861 ± 0.8322.5 ± 1.2339.7 ± 19.622.5 ± 7.9
  • GPT-4在量化、提取、数值推理和股票交易方面处于领先;Gemini在生成和预测方面表现突出。
  • 指令微调提升简单任务的性能,但在复杂数值推理、生成与预测方面效果较差。
  • 开源/中文微调模型在某些分类任务上表现强劲,但跨语言效应和数据集对齐会影响结果。
  • 股票交易任务揭示LLMs的通用智能能力,在所评估模型中GPT-4实现了最高的夏普比率和最低的最大回撤。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。