QUICK REVIEW

[論文レビュー] Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models

Duanyu Feng, Y. S. Dai|arXiv (Cornell University)|Oct 1, 2023

Financial Distress and Bankruptcy Prediction被引用数 12

ひとこと要約

この論文はCALMを紹介します。複数のオンライン金融タスクに対する一般istクレジットスコアリングのための命令調整済みLLMフレームワークで、9データセットのベンチマークと偏り分析およびオープンリソースを重視します。

ABSTRACT

In the financial industry, credit scoring is a fundamental element, shaping access to credit and determining the terms of loans for individuals and businesses alike. Traditional credit scoring methods, however, often grapple with challenges such as narrow knowledge scope and isolated evaluation of credit tasks. Our work posits that Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks. To systematically explore LLMs for credit scoring, we propose the first open-source comprehensive framework. We curate a novel benchmark covering 9 datasets with 14K samples, tailored for credit assessment and a critical examination of potential biases within LLMs, and the novel instruction tuning data with over 45k samples. We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks. We evaluate CALM, existing state-of-art (SOTA) methods, open source and closed source LLMs on the build benchmark. Our empirical results illuminate the capability of LLMs to not only match but surpass conventional models, pointing towards a future where credit scoring can be more inclusive, comprehensive, and unbiased. We contribute to the industry's transformation by sharing our pioneering instruction-tuning datasets, credit and risk assessment LLM, and benchmarks with the research community and the financial industry.

研究の動機と目的

LLMsが従来の単一タスク専門システムを超えて、多様なオンラインのクレジットおよびリスクタスクへ一般化できることを実証する。
クレジットとリスク評価のための9データセット（約14Kサンプル）の包括的ベンチマークを作成・公開する。
CALMを開発し、クレジットとリスクタスクに特化した大規模な命令チューニングコーパスを用いて命令調整済みLLMを作成する。
クレジットスコアリングとリスク評価へ適用した場合のLLMの潜在的なバイアスを調査し、倫理的配慮を提案する。

提案手法

クレジットスコアリング、詐欺検出、財務困難、請求分析を網羅した多様な表形式データベースを14Kサンプルで構築する。
45K命令チューニングデータセット（6データセット、リサンプリングによるバランス調整）を組み立て、表形式と記述形式のプロンプトを用いる。
LoRAを用いて5エポック、AdamW、学習率3e-4、ウェイトデカイ1e-5、最大入力長2048でLLaMa2-chatモデルをファインチューニングする。
CALMをSOTAの専門システムおよび複数のオープン/非オープンLLM（例：GPT-4、ChatGPT、Bloomz、Vicuna、Llama系統）と、正確さ、F1、MCC、バイアス指標で評価する。
AI FAIRNESS 360に従ってデータバイアス（Disparate Impact）とモデルバイアス（Equal Opportunity Difference、Average Odds Difference）を分析する。

実験結果

リサーチクエスチョン

RQ1H1: LLMは広範な事前学習を活用して、従来のクレジット/リスクシステムの狭い専門性を克服できるか。
RQ2H2: 命令調整済みLLMは金融データ上でファインチューニングを通じて複数の関連クレジットタスクを一般化/適応できるか。
RQ3H3: LLMの能力の進歩はクレジット決定における公平性バイアスを生じさせる、または拡大するか。

主な発見

LLMsは特にGPT-4が、いくつかのクレジット/リスクタスクで従来のモデルと同等または上回ることができる。
CALM（ファインチューニング済みLLM）は複数のクレジット/リスクタスク間で知識を転移し、訓練されていないデータセットでの性能を向上させる。
敏感属性におけるLLMの偏りは依然として観測され、導入時の倫理的監視の必要性を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。