QUICK REVIEW

[논문 리뷰] Is More Context Always Better? Examining LLM Reasoning Capability for Time Interval Prediction

Yanan Cao, Farnaz Fallahi|arXiv (Cornell University)|2026. 01. 15.

Topic Modeling인용 수 0

한 줄 요약

본 논문은 간헐적 구매 간격 예측을 위해 제로샷 LLM을 통계 및 ML 베이스라인과 벤치마킹하고, ML 모델이 LLM보다 우수하며 적당한 컨텍스트가 LLM의 성능을 향상시키는 반면 과도한 컨텍스트는 성능을 해친다는 것을 보여준다.

ABSTRACT

Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning and prediction across different domains. Yet, their ability to infer temporal regularities from structured behavioral data remains underexplored. This paper presents a systematic study investigating whether LLMs can predict time intervals between recurring user actions, such as repeated purchases, and how different levels of contextual information shape their predictive behavior. Using a simple but representative repurchase scenario, we benchmark state-of-the-art LLMs in zero-shot settings against both statistical and machine-learning models. Two key findings emerge. First, while LLMs surpass lightweight statistical baselines, they consistently underperform dedicated machine-learning models, showing their limited ability to capture quantitative temporal structure. Second, although moderate context can improve LLM accuracy, adding further user-level detail degrades performance. These results challenge the assumption that "more context leads to better reasoning". Our study highlights fundamental limitations of today's LLMs in structured temporal inference and offers guidance for designing future context-aware hybrid models that integrate statistical precision with linguistic flexibility.

연구 동기 및 목표

웹 행동에서 반복되는 사용자 행동 간의 시간 간격 예측 문제를 동기화하고 정의한다.
제로샷 설정에서 LLM과 통계 및 머신러닝 베이스라인의 체계적 비교를 수행한다.
시간적 작업에서 다양한 수준의 맥락 정보가 LLM의 추론에 미치는 영향을 평가한다.
통계적 정밀성과 언어적 유연성을 결합하는 하이브리드 모델의 설계 시사점을 강조한다.

제안 방법

제로샷 설정에서 세 가지 프롬프트 수준(제로/중간/높은 맥락)으로 최첨단 LLM(GPT-4o, Gemini-2.5, Claude-3.5)을 평가한다.
구조화된 특징에 대해 분위수 손실로 학습된 전통적 ML 모델(RandomForest, XGBoost, MLP)을 벤치마크한다.
가벼운 통계 추정기(평균, 중앙값, EMA)도 기준으로 포함한다.
사전 처리(5회 이하 구매 제외, 간격 상한 20일)된 실제 데이터 두 개를 사용한다.
회귀 지표(RMSE, MAE, MAPE)와 비즈니스 중심의 TA@k 지표(TA@0, TA@1, TA@2)로 성능을 평가한다.
배포 관련 지표(지연 시간과 예측당 비용)를 보고한다.

Figure 1. Illustration of the interval-prediction task and the three prompting conditions. The example is about repeated milk purchases with varying intervals (5 days → 9 days → 7 days → 6 days). The model observes historical intervals for a product category and predicts the next interval under zero

실험 결과

연구 질문

RQ1RQ1: LLM이 간Purchase 간격 예측에서 전통적 머신러닝 모델보다 우수할 수 있는가?
RQ2RQ2: 더 풍부한 맥락 정보를 제공하면 시간 간격 추론 과제에서 LLM의 성능이 향상되는가?

주요 결과

모델	TA@0	TA@1	TA@2	RMSE	MAE	MAPE
Proprietary data - GPT-4o-Z	5.75	12.72	18.65	23.66	15.50	73.03
Proprietary data - GPT-4o-M	6.13	13.83	19.92	22.95	14.76	66.78
Proprietary data - GPT-4o-H	5.32	12.72	18.48	24.83	16.39	76.59
Proprietary data - Gemini-2.5-Z	6.38	13.57	19.68	23.44	15.17	63.79
Proprietary data - Gemini-2.5-M	6.15	13.95	19.72	23.91	15.39	67.79
Proprietary data - Gemini-2.5-H	6.20	13.42	19.25	24.20	15.66	71.38
Proprietary data - Claude-3.5-Z	5.98	13.50	19.72	22.26	14.17	64.27
Proprietary data - Claude-3.5-M	6.55	14.58	20.97	21.93	13.85	57.45
Proprietary data - Claude-3.5-H	6.75	14.63	20.95	22.11	14.11	59.61
Proprietary data - ML Best	9.48	22.98	33.93	9.97	7.18	29.92
Proprietary data - Stat Best	4.42	13.15	20.32	22.46	14.25	55.41
Instacart data - GPT-4o-Z	6.54	15.46	22.16	30.11	16.13	77.39
Instacart data - GPT-4o-M	7.32	16.04	22.98	28.56	15.09	66.80
Instacart data - GPT-4o-H	6.00	14.12	20.12	31.05	17.01	84.03
Instacart data - Gemini-2.5-Z	7.30	15.76	22.82	28.85	15.27	64.13
Instacart data - Gemini-2.5-M	7.28	16.64	23.06	28.36	15.05	59.43
Instacart data - Gemini-2.5-H	6.26	14.80	21.36	29.46	16.17	74.38
Instacart data - Claude-3.5-Z	6.22	14.48	22.20	26.88	14.18	67.71
Instacart data - Claude-3.5-M	6.02	14.24	21.82	26.92	13.93	62.29
Instacart data - Claude-3.5-H	6.92	15.10	22.44	27.50	14.42	64.31
Instacart data - ML Best	8.46	22.62	33.42	9.17	6.55	35.04
Instacart data - Stat Best	5.90	15.34	23.00	27.97	14.52	56.34

ML 모델은 표준 오차 지표(MAPE, RMSE, MAE)에서 두 데이터셋 모두에서 LLM보다 성능 우위를 점한다.
독점 데이터에서 ML의 MAPE가 29.92%인 반면 최상의 LLM은 57.45%(Claude-3.5-M 대 Claude-3.5-H/다른 모델)이다.
동일 데이터에서 ML의 TA@1은 22.98%인 반면 최상의 LLM은 14.63%(Claude-3.5-M 대 Claude-3.5-H)이다.
LLMs는 통계적 최적값에 비해 맥락 신호를 활용하여 더 나은 성능을 보이나, 단순 중앙값 이상으로도 해석될 수 있다.
중간 맥락 프롬프트는 LLM의 성능을 일관되게 향상시키는 반면, 높은 맥락 프롬프트는 종종 정확도를 저하시켜 시간적 정밀도에 대한 맥락이 노이즈가 될 수 있음을 시사한다.
GPT-4o가 LLM 중 가장 빠르고 저렴하며; Claude-3.5는 느리고 비용이 더 많이 들고; Gemini-2.5는 가장 높은 지연을 보인다.USA
결과는 임피던스 불일치를 보여준다: LLM은 질적 추론에 탁월하지만 정밀한 정량적 타이밍에는 어려움을 겪으며, 하이브리드형의 맥락 인식 모델의 필요성을 제시한다.

Figure 2. Prompt designs for three context levels: Zero (historical intervals only), Medium (product metadata, summary statistics), and High (recency features, user lifecycle information).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.