QUICK REVIEW

[Paper Review] Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models

Boyu Zhang, Hongyang Yang|arXiv (Cornell University)|Oct 6, 2023

Stock Market Forecasting Methods8 citations

TL;DR

The paper presents a retrieval-augmented, instruction-tuned LLM framework for financial sentiment analysis that leverages external knowledge retrieval to improve accuracy and F1 scores, outperforming baselines and general-purpose LLMs.

ABSTRACT

Financial sentiment analysis is critical for valuation and investment decision-making. Traditional NLP models, however, are limited by their parameter size and the scope of their training datasets, which hampers their generalization capabilities and effectiveness in this field. Recently, Large Language Models (LLMs) pre-trained on extensive corpora have demonstrated superior performance across various NLP tasks due to their commendable zero-shot abilities. Yet, directly applying LLMs to financial sentiment analysis presents challenges: The discrepancy between the pre-training objective of LLMs and predicting the sentiment label can compromise their predictive performance. Furthermore, the succinct nature of financial news, often devoid of sufficient context, can significantly diminish the reliability of LLMs' sentiment analysis. To address these challenges, we introduce a retrieval-augmented LLMs framework for financial sentiment analysis. This framework includes an instruction-tuned LLMs module, which ensures LLMs behave as predictors of sentiment labels, and a retrieval-augmentation module which retrieves additional context from reliable external sources. Benchmarked against traditional models and LLMs like ChatGPT and LLaMA, our approach achieves 15\% to 48\% performance gain in accuracy and F1 score.

Motivation & Objective

Address limitations of traditional NLP and generic LLMs in financial sentiment analysis due to limited context and misaligned training objectives.
Propose a retrieval-augmented LLM framework combining instruction tuning and external knowledge retrieval.
Demonstrate performance gains on established financial sentiment benchmarks.
Showcase that RAG improves predictions for concise financial texts like news and tweets.

Proposed method

Construct an instruction-following dataset for financial sentiment analysis by formatting existing datasets with multiple human-written instructions.
Fine-tune open-source LLMs (e.g., Llama-7B) using a causal language modeling objective to predict sentiment labels.
Map generated outputs to predefined sentiment classes (negative/neutral/positive).
Implement a Retrieval-Augmented Generation module that retrieves context from external sources (Bloomberg, Reuters, Goldman Sachs, Seeking Alpha, Twitter, Reddit) via multi-source querying and similarity-based filtering.
Use a two-step retrieval: 1) Multi-Source Knowledge Query, 2) Similarity-Based Retrieval using an overlap coefficient (Szymkiewicz-Simpson) with threshold >0.8 to select relevant context.
Evaluate with accuracy and F1-score on FPB, Twitter Val, and additional datasets; compare against FinBERT, BloombergGPT, Llama-7B, ChatGLM2-6B, and ChatGPT-4.

Experimental results

Research questions

RQ1Can instruction tuning align LLM behavior to predicting financial sentiment labels more effectively than standard pretraining objectives?
RQ2Does retrieval-augmented generation provide significant gains by supplying external financial context for concise inputs like news headlines and tweets?
RQ3What is the comparative performance of the proposed framework against state-of-the-art financial sentiment models and general-purpose LLMs?
RQ4How does adding RAG affect sentiment predictions on benchmark datasets (FPB, Twitter Val) and case studies?

Key findings

Model	FPB Acc	FPB F1	Twitter Val Acc	Twitter Val F1
FinBERT	-	-	0.725	0.668
BloombergGPT	-	-	0.510	-
ChatGLM2-6B	0.474	0.402	0.482	0.381
Llama-7B	0.601	0.397	0.544	0.363
ChatGPT 4.0	0.643	0.511	0.788	0.652
Ours	0.758	0.739	0.863	0.811

Instruction-tuned Llama-7B achieves high performance, outperforming baselines on FPB and Twitter Val.
With RAG, the model further improves accuracy and F1, surpassing ChatGPT-4 in several setups.
On FPB and Twitter Val, the proposed method reaches 0.758 Acc / 0.739 F1 (without RAG) and 0.863 Acc / 0.811 F1 (with RAG) for their best setup.
ChatGPT-4.0 w/o RAG attains 0.788 Acc / 0.652 F1 on Twitter Val and 0.643/0.511 on FPB (from Table I); with RAG, ChatGPT-4.0 reaches 0.813 Acc / 0.708 F1 on Twitter Val (Table II).
Ours with RAG yields 0.881 Acc / 0.842 F1 on Twitter Val (Table II).
A case study shows RAG can convert ambiguous statements into more accurate positive sentiment by providing fetched context (Table III).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.