QUICK REVIEW

[论文解读] Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models

Duanyu Feng, Y. S. Dai|arXiv (Cornell University)|Oct 1, 2023

Financial Distress and Bankruptcy Prediction被引用 12

一句话总结

本文提出 CALM，一种用于跨多种在线金融任务的通用信用评分的指令微调大语言模型框架，基于 9 datasets 的基准，并强调偏见分析与开源资源。

ABSTRACT

In the financial industry, credit scoring is a fundamental element, shaping access to credit and determining the terms of loans for individuals and businesses alike. Traditional credit scoring methods, however, often grapple with challenges such as narrow knowledge scope and isolated evaluation of credit tasks. Our work posits that Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks. To systematically explore LLMs for credit scoring, we propose the first open-source comprehensive framework. We curate a novel benchmark covering 9 datasets with 14K samples, tailored for credit assessment and a critical examination of potential biases within LLMs, and the novel instruction tuning data with over 45k samples. We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks. We evaluate CALM, existing state-of-art (SOTA) methods, open source and closed source LLMs on the build benchmark. Our empirical results illuminate the capability of LLMs to not only match but surpass conventional models, pointing towards a future where credit scoring can be more inclusive, comprehensive, and unbiased. We contribute to the industry's transformation by sharing our pioneering instruction-tuning datasets, credit and risk assessment LLM, and benchmarks with the research community and the financial industry.

研究动机与目标

证明 LLMs 可以在超越单一任务的专业系统的 diverse 在线信用和风险任务上实现泛化。
创建并发布一个涵盖信用与风险评估的 9 数据集 (~14K samples) 的全面基准。
开发 CALM，一种面向信用与风险任务的指令微调大模型，使用大规模指令微调语料。
研究在应用于信用评分与风险评估时 LLMs 潜在的偏见，并提出伦理考量。

提出的方法

构建一个涵盖信用评分、欺诈检测、财务困境与理赔分析的多样化表格数据基准，样本量为 14K。
组建一个 45K 指令微调数据集（6 个数据集， via 重新采样进行平衡），并采用表格形式和描述性形式的提示。
使用 LoRA 对 LLaMa2-chat 模型进行微调，训练 5 个 epoch，AdamW，学习率 3e-4，权重衰减 1e-5，最大输入长度 2048。
在准确性、F1、MCC 和偏差指标上，将 CALM 与 SOTA 专家系统及多种开放/非开放的大语言模型（如 GPT-4、ChatGPT、Bloomz、Vicuna、Llama 系列）进行对比评估。
按照 AI FAIRNESS 360 分析数据偏差（Disparate Impact）和模型偏差（Equal Opportunity Difference、Average Odds Difference）。

实验结果

研究问题

RQ1H1: LLMs 是否能够通过广泛的预训练，在传统信用/风险系统的狭窄专长之外实现跨 diverse 在线任务的泛化？
RQ2H2: 指令微调的 LLM 是否能够通过对金融数据的微调，在多种相关的信用任务上实现泛化/适应？
RQ3H3: LLM 能力的进步是否会在信用决策中引入或放大公平性偏差？

主要发现

LLMs，特别是 GPT-4，能够在若干信用/风险任务上达到甚至超过一些传统模型。
CALM（经过微调的 LLM）展示了跨越多种信用/风险任务的知识迁移，并在未训练的数据集上提升了性能。
在对敏感属性的偏差上仍存在可观察的偏差，强调在部署时需要伦理监管。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。