QUICK REVIEW

[論文レビュー] NoLiMa: Long-Context Evaluation Beyond Literal Matching

Ali Modarressi, Hanieh Deilamsalehy|ArXiv.org|Feb 7, 2025

Speech and dialogue systems被引用数 4

ひとこと要約

NoLiMa は、質問と針の文字通りの重なりを最小化したベンチマークを作成し、文脈長が長くなるにつれて12の人気長文脈LLMの潜在的連想推論能力の限界を明らかにする。

ABSTRACT

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.

研究の動機と目的

表面的な文字列マッチを超えた長文脈理解の評価を動機づける。
質問と針の間の語彙的重複を最小化した針セットを設計し、潜在的連想推論をテストする。
最先端のLLMが短い文脈から非常に長い文脈（最大32Kトークン）へとスケールする際の性能を評価する。
潜在的跳躍、針の配置、文字列一致の除外など、長さの一般化に影響を与える要因を分析する。

提案手法

質問が連想リンク（1跳びおよび2跳び）を介して針と結びつくNoLiMa 針セットを構築する。
針を、開かれた書籍から構築した長い干草堆に埋め込み、干草堆内の注意を散らす情報や対立情報を除去する。
複数の文脈長で12モデル（≥128Kトークン対応）を評価し、長さごとに58問針ペアと5つの干草堆を用いる。
短い文脈（1K以下）から得られるベーススコアを用いて、長い文脈の性能を正規化する。
遅延跳び、逆転（質問構造）、および Chain-of-Thought プロンプトが性能に与える影響を分析する。
文字列一致（Direct および MC セットアップ）がタスク難易度に与える影響を示すアブレーションを行う。

実験結果

リサーチクエスチョン

RQ1文脈長を長くするにつれて、 literal overlaps を最小化した場合の長文脈検索におけるモデルの性能はどの程度低下するか？
RQ2潜在的推論ステップ（1跳び対2跳び）と文脈内の針配置がNoLiMaの正確さにどのように影響するか？
RQ3Chain-of-Thought プロンプトや推論ベースのモデルは長さの一般化ギャップをどの程度緩和できるか？
RQ4文字列一致や分散要因は、長文脈連想タスクにおけるモデルの成功にどのような影響を与えるか？

主な発見

Model	Claimed Length	Effective Length	Base Score	1K	2K	4K	8K	16K	32K
GPT-4o	128K	8K	99.3 (84.4)	98.1	98.0	95.7	89.2	81.6	69.7
Llama 3.3 70B	128K	2K	97.3 (82.7)	94.2	87.4	81.5	72.1	59.5	42.7
Llama 3.1 405B	128K	2K	94.7 (80.5)	89.0	85.0	74.5	60.1	48.4	38.0
Llama 3.1 70B	128K	2K	94.5 (80.3)	91.0	81.8	71.2	62.7	51.8	43.2
Gemini 1.5 Pro	2M	2K	92.6 (78.7)	86.4	82.7	75.4	63.9	55.5	48.2
Jamba 1.5 Mini	256K	<1K	92.4 (78.6)	76.3	74.1	70.8	62.2	52.7	43.6
Command R+	128K	<1K	90.9 (77.3)	77.0	73.5	66.3	39.5	21.3	7.4
Mistral Large 2	128K	2K	87.9 (74.7)	86.1	85.5	73.3	51.5	32.6	18.7
Claude 3.5 Sonnet	200K	4K	87.6 (74.4)	85.4	84.0	77.6	61.7	45.7	29.8
Gemini 1.5 Flash	1M	<1K	84.7 (72.0)	68.6	61.6	51.0	44.4	35.5	28.6
GPT-4o mini	128K	<1K	84.9 (72.2)	67.7	58.2	44.1	32.6	20.6	13.7
Llama 3.1 8B	128K	1K	76.7 (65.2)	65.7	54.4	44.1	31.9	22.6	14.2

ほとんどのモデルは短い文脈で高いベーススコアを示すが、実質的な長さ（性能がベースの80%以上を維持する長さ）はしばしば ≤2Kトークンであり、GPT-4o は顕著な例外。
32K文脈長での性能は著しく低下し、12モデル中10モデルは32Kでベーススコアの半分以下になる。
2跳びタスクは1跳びより難しく、長い文脈ほどそのギャップが拡大する。
反転針テンプレート（質問構造が直接の合図を遅らせる場合）はデフォルトテンプレートより難しい。
CoT プロンプトは性能を改善するが、特に16Kトークン超の2跳びタスクでは長い文脈のギャップを完全には埋められない。
文字列一致は解法を劇的に容易にする（Direct および MC セットアップ）、多くのベンチマークで表層的手掛かりに依存することを強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。