QUICK REVIEW

[論文レビュー] Is More Context Always Better? Examining LLM Reasoning Capability for Time Interval Prediction

Yanan Cao, Farnaz Fallahi|arXiv (Cornell University)|Jan 15, 2026

Topic Modeling被引用数 0

ひとこと要約

この論文は、インタ購入間隔を予測するためにゼロショットLLMを統計・MLベースラインと比較し、MLモデルがLLMを上回ること、適度な文脈がLLMの性能を向上させる一方で過度の文脈が性能を低下させることを示している。

ABSTRACT

Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning and prediction across different domains. Yet, their ability to infer temporal regularities from structured behavioral data remains underexplored. This paper presents a systematic study investigating whether LLMs can predict time intervals between recurring user actions, such as repeated purchases, and how different levels of contextual information shape their predictive behavior. Using a simple but representative repurchase scenario, we benchmark state-of-the-art LLMs in zero-shot settings against both statistical and machine-learning models. Two key findings emerge. First, while LLMs surpass lightweight statistical baselines, they consistently underperform dedicated machine-learning models, showing their limited ability to capture quantitative temporal structure. Second, although moderate context can improve LLM accuracy, adding further user-level detail degrades performance. These results challenge the assumption that "more context leads to better reasoning". Our study highlights fundamental limitations of today's LLMs in structured temporal inference and offers guidance for designing future context-aware hybrid models that integrate statistical precision with linguistic flexibility.

研究の動機と目的

Web行動における再発するユーザー行動間の時間間隔予測の問題を動機づけ定義する。
ゼロショットのLLMを統計・機械学習ベースラインと inter-purchase interval 予測で体系的に比較する。
時間的タスクにおける文脈情報レベルがLLM推論に与える影響を評価する。
統計的精度と言語的柔軟性を融合するハイブリッドモデルの設計示唆を強調する。

提案手法

最先端のLLM（GPT-4o、Gemini-2.5、Claude-3.5）を三つの prompting レベル（ゼロ、ミディアム、ハイコンテキスト）でゼロショット設定として評価する。
従来のMLモデル（RandomForest、XGBoost、MLP）を量子化損失で構造化特徴量に対して訓練しベンチマークする。
基準として軽量な統計推定量（平均、中央値、 EMA）を含める。
事前処理を行った二つの実世界データセット（独自の食料品データとInstacart）を使用（<5回購買を除外、間隔を20日で上限）。
回帰指標（RMSE、MAE、MAPE）およびビジネス指向の TA@k 指標（TA@0、TA@1、TA@2）で性能を評価する。
展開関連指標（予測あたりの待機時間とコスト）を報告する。

Figure 1. Illustration of the interval-prediction task and the three prompting conditions. The example is about repeated milk purchases with varying intervals (5 days → 9 days → 7 days → 6 days). The model observes historical intervals for a product category and predicts the next interval under zero

実験結果

リサーチクエスチョン

RQ1RQ1: LLMは inter-purchase interval 予測で従来の機械学習モデルを上回れるのか？
RQ2RQ2: より豊かな文脈情報の提供は時間間隔推論タスクにおけるLLMの性能を改善するのか？

主な発見

Model	TA@0	TA@1	TA@2	RMSE	MAE	MAPE
Proprietary data - GPT-4o-Z	5.75	12.72	18.65	23.66	15.50	73.03
Proprietary data - GPT-4o-M	6.13	13.83	19.92	22.95	14.76	66.78
Proprietary data - GPT-4o-H	5.32	12.72	18.48	24.83	16.39	76.59
Proprietary data - Gemini-2.5-Z	6.38	13.57	19.68	23.44	15.17	63.79
Proprietary data - Gemini-2.5-M	6.15	13.95	19.72	23.91	15.39	67.79
Proprietary data - Gemini-2.5-H	6.20	13.42	19.25	24.20	15.66	71.38
Proprietary data - Claude-3.5-Z	5.98	13.50	19.72	22.26	14.17	64.27
Proprietary data - Claude-3.5-M	6.55	14.58	20.97	21.93	13.85	57.45
Proprietary data - Claude-3.5-H	6.75	14.63	20.95	22.11	14.11	59.61
Proprietary data - ML Best	9.48	22.98	33.93	9.97	7.18	29.92
Proprietary data - Stat Best	4.42	13.15	20.32	22.46	14.25	55.41
Instacart data - GPT-4o-Z	6.54	15.46	22.16	30.11	16.13	77.39
Instacart data - GPT-4o-M	7.32	16.04	22.98	28.56	15.09	66.80
Instacart data - GPT-4o-H	6.00	14.12	20.12	31.05	17.01	84.03
Instacart data - Gemini-2.5-Z	7.30	15.76	22.82	28.85	15.27	64.13
Instacart data - Gemini-2.5-M	7.28	16.64	23.06	28.36	15.05	59.43
Instacart data - Gemini-2.5-H	6.26	14.80	21.36	29.46	16.17	74.38
Instacart data - Claude-3.5-Z	6.22	14.48	22.20	26.88	14.18	67.71
Instacart data - Claude-3.5-M	6.02	14.24	21.82	26.92	13.93	62.29
Instacart data - Claude-3.5-H	6.92	15.10	22.44	27.50	14.42	64.31
Instacart data - ML Best	8.46	22.62	33.42	9.17	6.55	35.04
Instacart data - Stat Best	5.90	15.34	23.00	27.97	14.52	56.34

MLモデルは標準的な誤差指標（MAPE、RMSE、MAE）においてLLMより優れている。
独自データではMLのMAPEは29.92%、最良のLLMは57.45%（Claude-3.5-M対 Claude-3.5-H/他）である。
同データにおけるTA@1はMLが22.98%、最良のLLMが14.63%（Claude-3.5-M対 Claude-3.5-H）。
LLMsは統計的な最良ベースラインを上回り、単純な中央値以上の文脈的手掛かりを利用していることを示す。
中程度の文脈プロンプトは一貫してLLMの性能を向上させる一方、高度な文脈プロンプトは精度を低下させることが多く、文脈は時間的精度にとってノイズとなる可能性を示唆。
GPT-4oはLLMの中で最速・最 cheaper、Claude-3.5は遅く高コスト、Gemini-2.5は最も遅延が大きい。USA
結果はインピーダンスミスマッチを示し、LLMsは定性的推論には優れているが、正確な定量的時系列 timing には苦労することから、ハイブリッドで文脈認識型のモデルが動機づけられる。

Figure 2. Prompt designs for three context levels: Zero (historical intervals only), Medium (product metadata, summary statistics), and High (recency features, user lifecycle information).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。