[论文解读] Summarization is (Almost) Dead
本研究表明,大语言模型(LLM)的零-shot 摘要在多项任务中往往比人类撰写和微调模型的摘要更受欢迎,挑战传统的摘要研究方向。
How well can large language models (LLMs) generate summaries? We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of LLMs across five distinct summarization tasks. Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. Specifically, LLM-generated summaries exhibit better factual consistency and fewer instances of extrinsic hallucinations. Due to the satisfactory performance of LLMs in summarization tasks (even surpassing the benchmark of reference summaries), we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs. However, we recognize that there are still some directions worth exploring, such as the creation of novel datasets with higher quality and more reliable evaluation methods.
研究动机与目标
- 评估跨五项任务(单条新闻、多条新闻、对话、代码、跨语言)的 LLM 零-shot 摘要质量。
- 通过人工评估将 LLM 生成的摘要与人类撰写的参考摘要以及微调模型的摘要进行比较。
- 研究不同摘要系统中的事实一致性与幻觉内容。
提出的方法
- 为五个摘要任务构建全新的评估数据集,确保数据在截断日期之后,以避免训练数据泄漏。
- 使用成对的人类判断对每项任务评估 GPT-3(text-davinci-003)、GPT-3.5、GPT-4,以及1–2 个微调基线模型。
- 衡量成对胜利率并计算 Cohen’s kappa 以评估评注者的一致性。
- 分析句子级幻觉并分类为内在型与外在型。
- 提供定性案例研究,以及包含任务特定分析的附录。
实验结果
研究问题
- RQ1LLMs 生成的摘要是否在五项任务中比人类撰写的摘要和微调模型的摘要更受人类评估者青睐?
- RQ2LLM 摘要在事实一致性方面是否更高,且不太容易出现外在(extrinsic)幻觉,相较于人类撰写或微调摘要?
- RQ3基于 LLM 的摘要有哪些局限性,未来的研究应聚焦于哪些方面?
- RQ4在 LLMs 时代,摘要数据集与评估方法应如何演变?
- RQ5在主题覆盖范围和长度灵活性方面,LLMs 与传统的微调模型有哪些差异?
主要发现
| 系统 | 单条新闻 | 多条新闻 | 跨语言 | 对话 | 代码 |
|---|---|---|---|---|---|
| GPT-4 | 8 | 5 | 16 | 5 | 9 |
| Human | 13 | 62 | 15 | 15 | 46 |
- LLM-generated summaries are consistently preferred by human evaluators over human-written and fine-tuned-model summaries across all five tasks.
- GPT-4 and other LLMs show lower rates of sentence-level hallucinations compared with several human-written references in some tasks, though extrinsic hallucinations are prominent in contexts with poor factual consistency.
- Extrinsic hallucinations largely explain the poorer factual consistency of some human-written references, especially in multi-news and code summarization.
- Fine-tuned models tend to produce fixed-length outputs and may miss topics when inputs cover multiple subjects, whereas LLMs adapt length and achieve broader topic coverage.
- A large-scale survey of recent ACL/EMNLP/COLING/NAACL papers suggests that roughly 70% of traditional summarization research may be less meaningful in the LLM era.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。