QUICK REVIEW

[论文解读] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv|arXiv (Cornell University)|Aug 28, 2023

Topic Modeling被引用 8

一句话总结

LongBench 是首个面向长文本理解的双语多任务基准，覆盖英语和中文的21个任务，约4,750 个测试实例，用于评估大型语言模型在长文档上的表现，采用自动化的 ROUGE-L 与 F1 指标。本文分析模型表现、上下文长度效应，以及基于检索/摘要的上下文压缩。

ABSTRACT

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

研究动机与目标

在多任务和多领域中定义一个全面的双语长文本理解基准。
将数据标准化为用于自动评估的统一格式。
评估当前的 LLMs 对长文档的表现，以及上下文长度对性能的影响。

提出的方法

在英文与中文中，涵盖六大类的21个任务（单文档问答、多文档问答、摘要、少样本学习、合成任务、代码完成）。
将数据集标准化为统一的评估格式，并使用自动化指标（ROUGE-L、F1、EM、CLS 准确性）。
创建 LongBench-E，使长度分布更加均匀，以研究不同上下文长度下的表现。
在零样本和少样本设置下评估八种长上下文的模型，包括 GPT-3.5-Turbo-16k 和多种开源模型。
研究基于检索的上下文压缩和基于摘要的上下文压缩技术及其对各模型的影响。
通过比较提供上下文与否来考察记忆化与真正的长文本理解之间的差异。

实验结果

研究问题

RQ1当前的 LLMs 在跨语言与跨领域的长上下文任务中的表现如何？
RQ2在 LongBench 与 LongBench-E 上，增加上下文长度对模型性能的影响如何？
RQ3基于检索的或基于摘要的上下文压缩方法是否能持续提升长上下文理解，针对哪些模型？
RQ4在长文档任务中，模型在记忆化与真正长上下文理解之间的依赖程度有多大？

主要发现

商业化的 GPT-3.5-Turbo-16k 通常优于开源模型，但在非常长的上下文下仍有挑战。
对位置嵌入的扩展和在更长序列上的微调在某些模型上可带来长上下文理解的显著提升。
基于检索的上下文压缩有助于较弱的长上下文模型，但并未完全弥合与更强长上下文能力之间的差距。
基于摘要的压缩在某些长任务和极长任务中有帮助，但在整个基准测试中的总体收益有限。
LongBench-E 显示，即使是在长上下文上训练或微调的模型，随着上下文长度增加也会出现明显的性能下降，凸显了真实的长上下文挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。