QUICK REVIEW

[论文解读] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Yunseung Lee, Subin Kim|arXiv (Cornell University)|Feb 19, 2026

Topic Modeling被引用 0

一句话总结

BankMathBench 引入了一个面向日常银行任务的领域特定数值推理基准，涵盖三种难度水平，结果显示对开源大语言模型进行微调和工具增强微调能显著提升银行计算的准确性。

ABSTRACT

Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.

研究动机与目标

在日常银行语境中证明需要一个领域特定的数值推理基准。
创建一个覆盖基础、中级和高级银行计算任务的多层级数据集。
使评估和改进大语言模型在银行场景中生成正确公式、进行多步计算的能力成为可能。
展示微调和工具增强对不同语言和模型规模的数值推理性能的影响。

提出的方法

使用 GPT-4o 和 o1-mini 的自动化数据生成管道，在三种难度水平下创建问答-推理三元组。
使用双重验证的自动化解题：基础层次使用数学/LaTeX；中级和高级使用可执行的 Python。
带有逐步自然语言推理和 <think>…</think> 标注以及 <calc>/<result> 标签的推理数据生成，将推理与计算分离。
由银行专业人士进行专家验证，确保实用相关性和数值正确性。
在 BankMathBench 上对开源大语言模型进行微调（4-bit LoRA），采用标准和工具增强两种方法。
在 <calc>…</calc> 块中调用外部计算器进行验证并获取精确结果的工具增强。

Figure 1 : Examples of frequently asked customer queries in real banking branches.

实验结果

研究问题

RQ1当前的 LLM 在基础、中级和高级任务中，在现实银行场景下的数值推理性能如何？
RQ2领域特定微调对银行计算中公式生成和数值准确性的影响如何？
RQ3工具增强微调是否进一步改善多步银行计算以及多种计算结果的整合？
RQ4语言与模型规模如何影响银行数值推理任务的性能？

主要发现

在不同语言中，零-shot 准确率随着任务难度的增加而下降。
对 BankMathBench 的微调显著提升了性能，在 Qwen3-8B 和 DeepSeek-Math-Instruct-7B 等模型在多语言情境下尤为显著。
工具增强微调在中级和高级任务，特别是在韩语数据集上，相对于 SFT 有显著提升。
通过微调，中位绝对误差显著下降；在高级数据集上，工具增强的误差接近于零。
韩语专用模型（如 Kanana 系列）在韩语数据上表现出色，而更大规模的多语言模型在多语言场景中提升更广泛。

Figure 2 : Overview of the BankMathBench data generation pipeline, which comprises three stages: (a) question generation, (b) solution generation and automatic verification, and (c) reasoning generation.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。