QUICK REVIEW

[论文解读] Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

Lochan Basyal, Mihir M. Sanghvi|arXiv (Cornell University)|Oct 16, 2023

Topic Modeling被引用 11

一句话总结

本论文比较 MPT-7b-instruct、Falcon-7b-instruct 和 OpenAI text-davinci-003 在 CNN/Daily Mail 与 XSum 数据集上的文本摘要性能，使用 BLEU、ROUGE 和 BERT 分数，发现 text-davinci-003 通常最强。

ABSTRACT

Text summarization is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques. This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models. The experiment was performed with different hyperparameters and evaluated the generated summaries using widely accepted metrics such as the Bilingual Evaluation Understudy (BLEU) Score, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score, and Bidirectional Encoder Representations from Transformers (BERT) Score. According to the experiment, text-davinci-003 outperformed the others. This investigation involved two distinct datasets: CNN Daily Mail and XSum. Its primary objective was to provide a comprehensive understanding of the performance of Large Language Models (LLMs) when applied to different datasets. The assessment of these models' effectiveness contributes valuable insights to researchers and practitioners within the NLP domain. This work serves as a resource for those interested in harnessing the potential of LLMs for text summarization and lays the foundation for the development of advanced Generative AI applications aimed at addressing a wide spectrum of business challenges.

研究动机与目标

评估不同大型语言模型在两个数据集（CNN/Daily Mail 和 XSum）上的抽象摘要任务中的表现。
分析模型规模和指令微调（MPT-7b-instruct、Falcon-7b-instruct）对摘要质量的影响。
使用 BLEU、ROUGE 和 BERT 分数量化摘要质量，以指导实际 NLP 任务中的模型选择。

提出的方法

在统一的推理设置下比较 LLMs（MPT-7b-instruct、Falcon-7b-instruct、text-davinci-003）（温度 = 0.1，最大 token 数 100）。
使用 LangChain 和 Hugging Face 的管道在配备 NVIDIA T4 GPU 的 GCE VM 上进行提示工程和执行。
用 BLEU、ROUGE（N、L）和 BERT Score 评估生成的摘要；报告每个数据集的平均单词数。

实验结果

研究问题

RQ1哪个 LLM 在 CNN/Daily Mail 和 XSum 摘要上提供最高的 ROUGE 与 BERT 分数？
RQ27B-instruct 模型在抽象摘要方面与 OpenAI text-davinci-003 相比如何？
RQ3数据集类型（CNN 与 XSum）对模型在各指标上的性能影响是什么？

主要发现

text-davinci-003 在两个数据集上一致获得较高的 ROUGE 和 BERT 分数。
在 7B-instruct 模型中，MPT-7b-instruct 通常优于 Falcon-7b-instruct。
CNN/Daily Mail 与 XSum 产生不同的平均词数，并且按模型和数据集有详细的度量变化。
ROUGE-1、ROUGE-2、ROUGE-L 以及 BERT 分数随模型和数据集而显著变化，在许多情况下偏向 OpenAI 模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。