QUICK REVIEW

[論文レビュー] FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets

Neng Wang, Hongyang Yang|arXiv (Cornell University)|Oct 7, 2023

Topic Modeling被引用数 8

ひとこと要約

この論文は、財務分野のオープンソース LLM に対する Instruction Tuning フレームワークを提示し、六つのベースモデル（Llama2、Falcon、BLOOM、MPT、ChatGLM2、Qwen）を、タスク固有・マルチタスク・ゼロショット設定で財務データセットを用いてベンチマークする。

ABSTRACT

In the swiftly expanding domain of Natural Language Processing (NLP), the potential of GPT-based models for the financial sector is increasingly evident. However, the integration of these models with financial datasets presents challenges, notably in determining their adeptness and relevance. This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models, specifically adapted for financial contexts. Through this methodology, we capitalize on the interoperability of open-source models, ensuring a seamless and transparent integration. We begin by explaining the Instruction Tuning paradigm, highlighting its effectiveness for immediate integration. The paper presents a benchmarking scheme designed for end-to-end training and testing, employing a cost-effective progression. Firstly, we assess basic competencies and fundamental tasks, such as Named Entity Recognition (NER) and sentiment analysis to enhance specialization. Next, we delve into a comprehensive model, executing multi-task operations by amalgamating all instructional tunings to examine versatility. Finally, we explore the zero-shot capabilities by earmarking unseen tasks and incorporating novel datasets to understand adaptability in uncharted terrains. Such a paradigm fortifies the principles of openness and reproducibility, laying a robust foundation for future investigations in open-source financial large language models (FinLLMs).

研究の動機と目的

金融分野の開源 FinLLMs に適した Instruction Tuning パラダイムを導入する。
基本タスクからマルチタスク・ゼロショットの全体的な低コストベンチマーク手法を提供する。
財務タスクに対する diverse なオープンソースベースモデル（Llama2、Falcon、BLOOM、MPT、ChatGLM2、Qwen）を分析する。
データ準備、指示、学習ワークフローを共有することで公開性と再現性を促進する。

提案手法

3 段階の Instruction Tuning パラダイムを説明する：Task-Specific Tuning、Multi-Task Tuning、Zero-Shot Tuning。
SA、HC、NER、RE に対してタスク固有の指示を使用；NER/RE を整合のため分類バリアント（NER(CLS)、RE(CLS)）に変換。
六つのベースモデル（Llama2-7B、Falcon-7B、BLOOM-7.1B、MPT-7B、ChatGLM2-6B、Qwen-7B）に LoRA を適用（rank 8、alpha 32）
Instruction/Input/Answer フォーマットでプロンプトを構築・標準化；ゼロショットプロンプトには Options セクションを追加；各タスクにつき ten 個のユニークなプロンプトを使用。
3 段階で訓練を実施（SA/HC/RE：8 エポック、NER：50 エポック、マルチタスク：4 エポック、ゼロショット：1 エポック）; SA ベースのチェックポイントで検証。
コスト算定（GPU 設定、総計 90 時間、約 $302.4）と効率化のための FP16 学習を提供。

実験結果

リサーチクエスチョン

RQ1Instruction tuning 後、財務特有のタスクで異なるオープンソース FinLLMs はどのように性能を発揮するか？
RQ2マルチタスク指示 tuning が SA、HC、NER、RE のタスク性能に与える影響は？
RQ3ゼロショット tuning は見たことのない財務タスクへ一般化できるか、プロンプトをどう再構成すると幻覚を減らせるか？
RQ4分類と情報抽出タスクの両方で、どのベースモデルが汎用性と性能のバランスを最もとるか？
RQ5現実世界の金融アプリケーションにおけるオープンソース FinLLMs のデータ・プロンプト・訓練コストといった実務的考慮点は何か？

主な発見

Llama2 が全タスクの平均ランキングで最高を記録（平均 ranking 2.0）。
BLOOM は情報抽出タスク（NER、RE）で優れるが、SA/HC の分類タスクでは不利。
MPT は SA でトップ、NER・RE では遅れをとる；Qwen はタスク間でバランスのとれた汎用性を示す。
マルチタスク調整は混合効果を生み出す：多くのモデルで RE を大幅に改善する一方、分類タスクは一部モデルで低下する可能性。
ゼロショット実験は ChatGLM2 と Falcon の強い一般化を示す一方、BLOOM と Qwen は特定の分類で苦戦。
ゼロショットの結果は、タスクの再構成と適応的学習が幻覚を緩和し、未見タスクへの整合性を改善できることを示唆。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。