Skip to main content
QUICK REVIEW

[Paper Review] Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

Lochan Basyal, Mihir M. Sanghvi|arXiv (Cornell University)|Oct 16, 2023
Topic Modeling11 citations
TL;DR

The paper compares text summarization performance of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI text-davinci-003 on CNN/Daily Mail and XSum datasets using BLEU, ROUGE, and BERT scores, finding text-davinci-003 generally strongest.

ABSTRACT

Text summarization is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques. This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models. The experiment was performed with different hyperparameters and evaluated the generated summaries using widely accepted metrics such as the Bilingual Evaluation Understudy (BLEU) Score, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score, and Bidirectional Encoder Representations from Transformers (BERT) Score. According to the experiment, text-davinci-003 outperformed the others. This investigation involved two distinct datasets: CNN Daily Mail and XSum. Its primary objective was to provide a comprehensive understanding of the performance of Large Language Models (LLMs) when applied to different datasets. The assessment of these models' effectiveness contributes valuable insights to researchers and practitioners within the NLP domain. This work serves as a resource for those interested in harnessing the potential of LLMs for text summarization and lays the foundation for the development of advanced Generative AI applications aimed at addressing a wide spectrum of business challenges.

Motivation & Objective

  • Evaluate how different LLMs perform on abstractive summarization tasks across two datasets (CNN/Daily Mail and XSum).
  • Analyze the impact of model size and instruction-tuning (MPT-7b-instruct, Falcon-7b-instruct) on summary quality.
  • Quantify summarization quality using BLEU, ROUGE, and BERT scores to guide model choice for practical NLP tasks.

Proposed method

  • Compare LLMs (MPT-7b-instruct, Falcon-7b-instruct, text-davinci-003) under a uniform inference setup (temperature 0.1, max tokens 100).
  • Use LangChain and Hugging Face pipelines for prompt engineering and execution on GCE VM with NVIDIA T4 GPUs.
  • Evaluate generated summaries with BLEU, ROUGE (N, L), and BERT Score; report average word counts per dataset.

Experimental results

Research questions

  • RQ1Which LLM provides the highest ROUGE and BERT scores for CNN/Daily Mail and XSum summaries?
  • RQ2How do 7B-instruct models compare to the OpenAI text-davinci-003 in abstractive summarization?
  • RQ3What is the influence of dataset type (CNN vs XSum) on model performance across metrics?

Key findings

  • text-davinci-003 consistently achieves high ROUGE and BERT scores across both datasets.
  • Among the 7B-instruct models, MPT-7b-instruct generally outperforms Falcon-7b-instruct.
  • CNN/Daily Mail and XSum yield different average word counts, with detailed metric variations by model and dataset.
  • ROUGE-1, ROUGE-2, ROUGE-L and BERT scores vary notably by model and dataset, favoring the OpenAI model in many cases.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.