QUICK REVIEW

[论文解读] BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Yapei Chang, Kyle Lo|arXiv (Cornell University)|Oct 1, 2023

Topic Modeling被引用 14

一句话总结

本文研究基于大型语言模型（LLM）生成的书籍级摘要的一致性，并引入 BooookScore，一种经细粒度人工注释验证的自动度量，用于比较提示策略、基础模型以及长文档的分块设置。

ABSTRACT

Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K USD and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than those generated by open-source models. While LLaMA 2 falls behind other models, Mixtral achieves performance on par with GPT-3.5-Turbo. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by annotators.

研究动机与目标

评估通过分块与合并或增量更新产生的基于LLM的书籍级摘要中的一致性错误

提出的方法

使用新出版的书籍定义一种新的人工一致性评估协议，以避免数据污染
在100本书上对GPT-4生成的摘要收集1193个跨度级的人类注释，采用两种提示策略（分层合并和增量更新）
开发 BooookScore，一种自动的句子级一致性度量，使用少量示例提示来检测八种一致性错误类型
将 BooookScore 与人工注释进行验证，以确立其精确度和可靠性（约78-80% 的精确度）
使用 BooookScore 和成本分析系统地评估不同的 LLM、分块大小和提示策略

实验结果

研究问题

RQ1现代LLM在通过分块合并或增量更新对书籍长度的文档进行摘要时，会出现哪些一致性错误？
RQ2自动度量 BooookScore 能否在没有金标准参考的情况下可靠地检测这些一致性错误？
RQ3提示策略、基础LLM以及分块大小如何影响书籍长度摘要的连贯性和细节程度？

主要发现

在书籍长度的摘要中出现八种一致性错误类型，包括因果省略和显著性错误，省略错误最为常见。
分层合并产生的摘要更有连贯性但细节较少，不如增量更新详细。
在测试设置下，GPT-4和Claude 2比LLaMA 2或ChatGPT产生的摘要更连贯。
对于 Claude 2，增量更新受益于更大分块大小，而分层合并则效果不佳；LLaMA 2 整体表现较差。
BooookScore 与人类判断高度一致（约 78.2% 的精确度 vs. 79.7% 的人类），并支持节省成本的分析（在人工评估中节省约 15K 美元，等同 500 小时）。
闭源模型（GPT-4、Claude 2）在一致性方面优于开源模型，尽管较长的输出可能带来更高的细节和成本。
人类偏好与 BooookScore 并非完全相关，表明一致性与细节水平之间存在权衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。