QUICK REVIEW

[論文レビュー] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv|arXiv (Cornell University)|Aug 28, 2023

Topic Modeling被引用数 8

ひとこと要約

LongBenchは、長文文脈理解のための初のバイリンガル・マルチタスク評価指標で、英語と中国語を対象に21タスク、約4,750のテスト事例を用いてLLMを長文ドキュメントで評価し、ROUGE-LおよびF1指標を自動で算出します。論文ではモデルの性能、文脈長の影響、検索・要約ベースの文脈圧縮の評価を分析します。

ABSTRACT

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

研究の動機と目的

長文脈理解を対象とした複数タスク・領域を横断する総合的なバイリンガルベンチマークを定義する。
自動評価のためのデータを統一フォーマットへ標準化する。
現在のLLMが長文ドキュメントでどのようにパフォーマンスを発揮するか、文脈長がパフォーマンスに与える影響を評価する。

提案手法

英語と中国語で、単一文書QA、複数文書QA、要約、few-shot学習、合成タスク、コード補完の6カテゴリに分け、計21タスクを組成する。
自動指標（ROUGE-L、F1、EM、CLS精度）を用いた統一評価フォーマットへデータセットを標準化する。
LongBench-Eを作成し、文脈長分布をより均一化してさまざまな文脈長での性能を検討する。
GPT-3.5-Turbo-16kを含む8つの長文文脈LLMをゼロショットおよびFew-shot設定で評価する（オープンモデルを含む）。
検索ベースおよび要約ベースの文脈圧縮技術を調査し、モデル間の影響を評価する。
提供された文脈の有無での性能を比較して、記憶と真の長文脈理解の違いを検証する。

実験結果

リサーチクエスチョン

RQ1現在のLLMは複数言語・領域に跨る長文脈タスクでどの程度性能を示すか？
RQ2LongBenchおよびLongBench-Eで文脈長を増やすとモデルの性能にどのような影響が出るか？
RQ3検索ベースまたは要約ベースの文脈圧縮は一貫して長文脈理解を改善するか、どのモデルで効果があるか？
RQ4長文ドキュメントタスクにおいて、モデルは記憶依存と真の長文脈理解のどちらにより依存しているか、どの程度か？

主な発見

商用GPT-3.5-Turbo-16kは一般的にオープンモデルより優れるものの、非常に長い文脈には依然として苦戦する。
長い系列への位置埋め込みのスケーリングや長いシーケンスでのファインチューニングは、いくつかのモデルで長文脈理解に顕著な効果をもたらす。
検索ベースの文脈圧縮は、長文脈の弱いモデルには有効だが、強い長文脈能力へのギャップを完全には埋めない。
要約ベースの圧縮は、長いまたは非常に長いタスクに一定の有用性を示すが、ベンチマーク全体での利益は限定的。
LongBench-Eは、長文文脈で訓練または微調整されたモデルでも文脈長が増えると性能が急激に低下することを示し、真の長文脈課題を浮き彫りにする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。