QUICK REVIEW

[论文解读] FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Xin Guo, Xia, Haotian|arXiv (Cornell University)|Aug 19, 2023

Topic Modeling被引用 10

一句话总结

FinEval 是一个面向中文金融领域的多项选择题基准，覆盖 4 类别（Finance、Economy、Accounting、Certificate），共计 4,661 道题；在零样本/小样本和仅答案/推理链提示下评估多种大型语言模型，发现 GPT-4 的准确率接近 70%。

ABSTRACT

Large language models have demonstrated outstanding performance in various natural language processing tasks, but their security capabilities in the financial domain have not been explored, and their performance on complex tasks like financial agent remains unknown. This paper presents FinEval, a benchmark designed to evaluate LLMs' financial domain knowledge and practical abilities. The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent. Financial Academic Knowledge comprises 4,661 multiple-choice questions spanning 34 subjects such as finance and economics. Financial Industry Knowledge contains 1,434 questions covering practical scenarios like investment research. Financial Security Knowledge assesses models through 1,640 questions on topics like application security and cryptography. Financial Agent evaluates tool usage and complex reasoning with 616 questions. FinEval has multiple evaluation settings, including zero-shot, five-shot with chain-of-thought, and assesses model performance using objective and subjective criteria. Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting. Our work provides a comprehensive benchmark closely aligned with Chinese financial domain.

研究动机与目标

在综合基准中评估中文语言模型在金融领域通识知识的能力。
覆盖四类目（Finance、Economy、Accounting、Certificate），数据来自模拟考试和教科书。
在多种提示设置下评估模型（零样本、少样本、仅答案、推理链。）
提供基线和公开排行榜以推动中文金融领域LLM的发展。

提出的方法

用4,661道题覆盖34个科目（Finance、Economy、Accounting、Certificate）来构建 FinEval。
使用四种提示模式：zero-shot AO、zero-shot CoT、five-shot AO、five-shot CoT。
将所有题目转换为四选项格式，并在需要时提供英文翻译以提高可读性。
对数据进行人工精炼和整理，分为开发集、验证集和测试集，并采用结构化的 LaTeX 友好格式。
在27个具备中文能力的LLM中评估，并在模型的最佳设置下报告准确率。

实验结果

研究问题

RQ1state-of-the-art 中文和英文LLM在 FinEval 的领域特定金融知识上的表现如何？
RQ2不同的提示设置（AO vs CoT；零-shot vs few-shot）如何影响中文金融任务的表现？
RQ3在 Finance、Economy、Accounting、Certificate 四个领域中，按规模和架构划分的模型哪些最能处理中文金融领域知识？
RQ4在中文金融领域的MCQ中，推理链提示是否有益，以及在什么条件下有益？

主要发现

Model	Size	Finance	Economy	Accounting	Certificate	Average
GPT-4	unknown	71.0	74.5	59.3	70.4	68.6
ChatGPT	175B	59.3	61.6	45.2	55.1	55.0
Qwen-7B	7B	54.5	54.4	50.3	55.8	53.8
Qwen-Chat-7B	7B	51.5	52.1	44.5	53.6	50.5
Baichuan-13B-Base	13B	52.6	50.2	43.4	53.5	50.1
Baichuan-13B-Chat	13B	51.6	51.1	41.7	52.8	49.4
ChatGLM2-6B	6B	46.5	46.4	44.5	51.5	47.4
InternLM-7B	7B	49.0	49.2	40.5	49.4	47.1
InternLM-Chat-7B	7B	48.4	49.1	40.8	49.5	47.0
LLaMA-2-Chat-70B	70B	47.1	46.7	41.5	45.7	45.2
Falcon-40B	40B	45.4	43.2	35.8	44.8	42.4
Baichuan-7B	7B	44.9	41.5	34.9	45.6	42.0
LLaMA-2-Chat-13B	13B	41.6	38.4	34.1	42.1	39.3
Ziya-LLaMA-13B-v1	13B	43.3	36.9	34.3	41.2	39.3
Bloomz-7b1-mt	7B	41.4	42.1	32.5	39.7	38.8
LLaMA-2-13B	13B	39.5	38.6	31.6	39.6	37.4
ChatGLM-6B	6B	38.8	36.2	33.8	39.1	37.2
Chinese-Llama-2-7B	7B	37.8	37.8	31.4	36.7	35.9
Chinese-Alpaca-Plus-7B	7B	30.5	33.4	32.7	38.5	34.0
moss-moon-003-sft	16B	35.6	34.3	28.7	35.6	33.7
LLaMA-2-Chat-7B	7B	35.6	31.8	31.9	34.0	33.5
LLaMA-2-7B	7B	34.9	36.4	31.4	31.6	33.4
AquilaChat-7B	7B	34.2	31.3	29.8	36.2	33.1
moss-moon-003-base	16B	32.2	33.1	29.2	30.7	31.2
Aquila-7B	7B	27.1	31.6	32.4	33.6	31.2
LLaMA-13B	13B	33.1	29.7	27.2	33.6	31.1
Falcon-7B	7B	28.5	28.2	27.5	27.4	27.9
Out of the 27 models	-	-	-	-	-	-

GPT-4 在各类别的平均准确率最高（约 68.6% 总体）并在若干类别接近 70%。
在27个模型中，GPT-4 常常优于其他模型；ChatGPT 为第二，平均准确率约 55.0%。
如 Qwen-7B、Qwen-Chat-7B、Baichuan-13B-Base/Chat 等中文大模型的平均准确率超过 ~50%，但在推理链提示下表现有所下降。
在所有模型中，推理链设置的平均准确率通常低于仅答案设置，表明在此任务中 CoT 并非普遍有益。
同一系列中较大模型通常表现更好，但各类别的提升幅度不同。
FinEval 的结果表明当前中文金融领域能力仍有较大提升空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。