[论文解读] FinanceBench: A New Benchmark for Financial Question Answering
FinanceBench 引入一个带开放题库的基准,在40家公开上市公司中有10,231道题,用以评估LLMs在金融问答中的表现,揭示当前模型在缺乏检索或长上下文时的显著局限。
FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.
研究动机与目标
- 使用带检索的开放式设置评估最先进的LLMs在金融问题回答中的能力与局限。
- 提供一个健壮且具有生态效度的数据集(10,231道题),覆盖领域相关、新颖以及基于指标生成的查询。
- 分析检索、长上下文窗口及提示策略如何影响金融问答中的模型表现。
- 识别常见的失败模式(幻觉、错误答案、拒答),以引导更安全的企业部署。
提出的方法
- 从40家公开交易的美国公司及361份披露(2015–2023)构建FinanceBench,生成10,231道题、答案和证据三元组。
- 三类题目分类法:领域相关、新颖生成、指标生成的问题;为每份披露推导14–18个基础指标并生成派生指标题。
- 创建一个150个样本的人类评估集,包含专家标注(正确/错误/拒绝),覆盖不同配置和提示顺序。
- 评估16种模型配置(GPT-4、GPT-4-Turbo、Claude2、Llama2)在五种设定(Closed Book、Oracle、Single Vector Store、Shared Vector Store、Long Context)和两种提示顺序(Context-First、Context-Last)下的表现。
- 对模型回答进行正确性标注,包括定性模式,如高质量证据、有效的替代答案、幻觉、拒答和无关性。
实验结果
研究问题
- RQ1目前的LLMs在带检索证据的开放式金融问答中的表现如何?
- RQ2检索策略(单一向量存储 vs 共享向量存储)对金融问答的正确性与错误类型有何影响?
- RQ3长上下文访问如何影响金融问题的表现,尤其是对指标生成的查询?
- RQ4提示顺序和Oracle访问如何影响不同配置下的模型成功率?
- RQ5LLMs在金融问答中的常见失败模式(幻觉、拒答)有哪些?
主要发现
| 模型配置 | 问题上下文配置 | 正确答案 | 错误答案 | 未能回答 | 合计 |
|---|---|---|---|---|---|
| GPT-4-Turbo | Closed Book | 14 (9%) | 5 (3%) | 126 (88%) | 150 |
| Llama2 | Shared Vector Store | 29 (19%) | 104 (70%) | 17 (11%) | 150 |
| GPT-4-Turbo | Shared Vector Store | 29 (19%) | 20 (13%) | 101 (68%) | 150 |
| Llama2 | Single Vector Store | 62 (41%) | 81 (54%) | 7 (5%) | 150 |
| GPT-4-Turbo | Single Vector Store | 75 (50%) | 17 (11%) | 58 (39%) | 150 |
| Claude2 | Long Context | 114 (76%) | 32 (21%) | 4 (3%) | 150 |
| GPT-4-Turbo | Long Context | 118 (79%) | 26 (17%) | 6 (4%) | 150 |
| GPT-4-Turbo | Oracle | 128 (85%) | 22 (15%) | 0 (0%) | 150 |
- 若无检索或长上下文,模型在FinanceBench上的表现很差(例如GPT-4-Turbo Closed Book仅有9%正确)。
- 检索和长上下文增强显著提升表现,Oracle设置达到85%,长上下文在不同模型下可达到79–85%。
- 单文档的向量存储配置通常优于单一共享向量存储,且逐文档索引可带来更高的准确性(例如GPT-4-Turbo:50% vs 19%;Llama2:41% vs 19%)。
- Context-First 提示在GPT-4-Turbo和Claude2的长上下文表现上有显著提升(例如长上下文中78%对25%)。
- 模型仍存在幻觉和错误答案等弱点;拒绝回答可能更安全,但仍表明在实际使用中的局限性。
- 绩效随题型而异,指标生成的问题最具挑战性,因为涉及数值推理和跨文档检索。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。