QUICK REVIEW

[論文レビュー] Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

Lochan Basyal, Mihir M. Sanghvi|arXiv (Cornell University)|Oct 16, 2023

Topic Modeling被引用数 11

ひとこと要約

PaperはCNN/Daily MailとXSumデータセットに対するMPT-7b-instruct、Falcon-7b-instruct、OpenAI text-davinci-003のテキスト要約性能をBLEU、ROUGE、BERTスコアで比較し、text-davinci-003が一般に最も強力であると結論付ける。

ABSTRACT

Text summarization is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques. This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models. The experiment was performed with different hyperparameters and evaluated the generated summaries using widely accepted metrics such as the Bilingual Evaluation Understudy (BLEU) Score, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score, and Bidirectional Encoder Representations from Transformers (BERT) Score. According to the experiment, text-davinci-003 outperformed the others. This investigation involved two distinct datasets: CNN Daily Mail and XSum. Its primary objective was to provide a comprehensive understanding of the performance of Large Language Models (LLMs) when applied to different datasets. The assessment of these models' effectiveness contributes valuable insights to researchers and practitioners within the NLP domain. This work serves as a resource for those interested in harnessing the potential of LLMs for text summarization and lays the foundation for the development of advanced Generative AI applications aimed at addressing a wide spectrum of business challenges.

研究の動機と目的

二つのデータセット（CNN/Daily MailとXSum）にわたる抽象要約タスクにおける異なるLLMの性能を評価する。
要約品質に対するモデルサイズと指示チューニング（MPT-7b-instruct、Falcon-7b-instruct）の影響を分析する。
BLEU、ROUGE、および BERTスコアを用いて要約品質を定量化し、実務的なNLPタスクにおけるモデル選択を導く。

提案手法

一様な推論設定（temperature 0.1、max tokens 100）でLLMs（MPT-7b-instruct、Falcon-7b-instruct、text-davinci-003）を比較する。
GCE VM上のNVIDIA T4 GPUでプロンプト設計と実行にLangChainとHugging Faceパイプラインを用いる。
生成要約をBLEU、ROUGE（N、L）、およびBERTスコアで評価し、データセットごとの平均語数を報告する。

実験結果

リサーチクエスチョン

RQ1Which LLM provides the highest ROUGE and BERT scores for CNN/Daily Mail and XSum summaries?
RQ2How do 7B-instruct models compare to the OpenAI text-davinci-003 in abstractive summarization?
RQ3What is the influence of dataset type (CNN vs XSum) on model performance across metrics?

主な発見

LLM Model	Dataset	平均語数	ROUGE-1	ROUGE-2	ROUGE-L	BERTスコア（P/R/F1）
falcon-7b-instruct	CNN (n=25)	784.24	0.226	0.053	0.197	0.818 / 0.860 / 0.838
falcon-7b-instruct	XSum (n=25)	410.44	0.139	0.014	0.113	0.787 / 0.863 / 0.823
mpt-7b-instruct	CNN (n=25)	784.24	0.236	0.060	0.213	0.839 / 0.864 / 0.851
mpt-7b-instruct	XSum (n=25)	410.44	0.159	0.024	0.133	0.828 / 0.871 / 0.848
text-davinci-003	CNN (n=25)	784.24	0.272	0.096	0.255	0.854 / 0.883 / 0.868
text-davinci-003	XSum (n=25)	410.44	0.206	0.053	0.173	0.844 / 0.893 / 0.868

text-davinci-003 consistently achieves high ROUGE and BERT scores across both datasets.
Among the 7B-instruct models, MPT-7b-instruct generally outperforms Falcon-7b-instruct.
CNN/Daily Mail and XSum yield different average word counts, with detailed metric variations by model and dataset.
ROUGE-1, ROUGE-2, ROUGE-L and BERT scores vary notably by model and dataset, favoring the OpenAI model in many cases.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。