QUICK REVIEW

[논문 리뷰] BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Yapei Chang, Kyle Lo|arXiv (Cornell University)|2023. 10. 01.

Topic Modeling인용 수 14

한 줄 요약

본 논문은 LLM이 생성한 책 길이 요약의 일관성을 연구하고, 세밀한 인간 주석과 대조하여 검증된 자동 지표인 BooookScore를 도입하여 프롬프트 전략, 기본 모델, 긴 문서에 대한 청크 설정을 비교한다.

ABSTRACT

Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K USD and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than those generated by open-source models. While LLaMA 2 falls behind other models, Mixtral achieves performance on par with GPT-3.5-Turbo. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by annotators.

연구 동기 및 목표

청크화 및 병합 또는 점진적 업데이트를 통해 생성된 LLM 기반의 책 길이 요약의 일관성 오류를 평가한다.

제안 방법

데이터 오염을 피하기 위해 새로 출간된 책을 사용하여 새로운 인간 일관성 평가 프로토콜을 정의한다
두 가지 프롬프트 전략(계층적 병합 및 점진적 업데이트)을 사용하여 GPT-4 생성 요약에 대해 100권의 책에서 1193 스팬 수준 인간 주석을 수집한다
몇 샷 프롬프트를 사용하여 8가지 일관성 오류 유형을 탐지하는 자동화된 문장 단위 일관성 지표인 BooookScore를 개발한다
BooookScore를 인간 주석과 대조하여 정밀도와 신뢰성을 확립한다(대략 78-80% 정밀도)
BooookScore와 비용 분석을 사용하여 다양한 LLM, 청크 크기 및 프롬프트 전략을 체계적으로 평가한다

실험 결과

연구 질문

RQ1청크화와 병합 또는 점진적 업데이트를 통해 책 길이 문서를 요약할 때 현대의 LLM은 어떤 일관성 오류를 범하는가?
RQ2골드 레퍼런스 없이도 자동 지표 BooookScore가 이러한 일관성 오류를 신뢰성 있게 감지할 수 있는가?
RQ3프롬프트 전략, 기본 LLM 및 청크 크기가 책 길이 요약의 일관성과 세부 사항에 어떤 영향을 미치는가?

주요 결과

여덟 가지 일관성 오류 유형이 책 길이 요약에서 나타나며, 인과 누락과 중요도 오류를 포함하고 누락 오류가 가장 흔하다.
계층적 병합은 점진적 업데이트보다 더 일관되지만 세부 정보가 적은 요약을 산출한다.
GPT-4와 Claude 2가 테스트된 설정에서 LLaMA 2 또는 ChatGPT보다 더 일관된 요약을 생성한다.
Claude 2의 경우 더 큰 청크 크기가 점진적 업데이트에 이익을 주는 반면 계층적 병합은 그렇지 않다; LLaMA 2는 전반적으로 성능이 떨어진다.
BooookScore는 인간 판단과 밀접하게 일치하며(약 78.2% 정밀도 대 79.7% 인간), 비용 절감 분석을 가능하게 한다(인간 평가에서 15K 달러 절감, 500 시간).
폐쇄형 모델(GPT-4, Claude 2)은 일관성 면에서 오픈 소스 모델보다 우수하나, 더 긴 출력이 더 높은 세부 정보와 비용을 수반할 수 있다.
인간 선호도는 BooookScore와 완벽하게 상관하지 않아 일관성과 세부 수준 간의 트레이드오프를 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.