Skip to main content
QUICK REVIEW

[论文解读] ProAgentBench: Evaluating LLM Agents for Proactive Assistance with Real-World Data

Yuanbo Tang, Huaze Tang|arXiv (Cornell University)|Feb 4, 2026
Personal Information Management and User Behavior被引用 0
一句话总结

ProAgentBench 引入一个真实世界、隐私保护的数据集以及一个两阶段框架,用于评估主动AI代理在何时提供帮助以及如何提供帮助的能力,显示真实世界数据和长期上下文可以提升性能。

ABSTRACT

Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human-AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM-synthesized data that fails to capture authentic human decision-making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre-assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy-compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM- and VLM-based baselines. Numerically, we showed that long-term memory and historical context significantly enhance prediction accuracy, while real-world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench-6BC0.

研究动机与目标

  • 在真实工作流场景中为主动代理建立基准。
  • 捕捉带有事前协助上下文的真实、长期用户互动数据。
  • 提出一个两阶段框架(When to Assist, How to Assist)用于系统性评估。
  • 量化真实世界数据和长期记忆对主动干预的影响。
  • 提供跨LLM与VLM的基线,为未来研究提供指南。

提出的方法

  • 提出一个分层的“When + How”框架,将主动协助分解为时机预测与内容生成。
  • 组装一个符合隐私要求的数据集,包含来自500+小时真实用户会话的28,528个事件,保留爆发式交互模式。
  • 使用带匿名化、人工在环审核和基于LLM的自动事件标注的数据收集流水线。
  • 评估一组多样化的LLM和VLM基线,包括基于提示的方法(Zero-shot、CoT、Self-Consistency)以及基于记忆的方法(RAG、Knowledge Graph、Clustering)。
  • 分析历史观测长度和长期用户上下文对预测与生成任务的影响。
Figure 1 : Illustration of Proactive Agent Workflow. The agent continuously monitors user screen activities and contextual signals. When assistance is needed, it proactively determines when to intervene and how to assist based on historical observations and user behavior patterns.
Figure 1 : Illustration of Proactive Agent Workflow. The agent continuously monitors user screen activities and contextual signals. When assistance is needed, it proactively determines when to intervene and how to assist based on historical observations and user behavior patterns.

实验结果

研究问题

  • RQ1历史观测序列长度如何影响 When to Assist 与 How to Assist 的性能?
  • RQ2纳入长期用户上下文(记忆)对主动援助的影响如何,哪种记忆策略最有效?
  • RQ3真实世界训练数据在微调主动代理方面是否优于合成数据?
  • RQ4提示策略(Zero-shot、CoT、Self-Consistency)在不同模型上的帮助或阻碍程度如何?
  • RQ5能反映主动干预现实世界生产力的实用指标是什么?

主要发现

ModelMethodWhen to Assist AccuracyWhen to Assist PrecisionWhen to Assist RecallWhen to Assist F1 ScoreHow to Assist Intention Acc.How to Assist Sem. Sim.
GPT-4o-miniZero-shot54.9%52.7%96.2%68.1%28.4%0.280
GPT-4o-miniCoT55.7%55.6%99.5%71.3%30.5%0.298
GPT-4o-miniSelf-Consistency55.2%52.8%96.0%68.2%28.2%0.280
Qwen3-MaxZero-shot59.3%55.5%93.4%69.7%36.3%0.285
Qwen3-MaxCoT59.8%59.6%72.5%65.4%38.2%0.305
Qwen3-MaxSelf-Consistency59.5%55.7%93.5%69.9%36.2%0.285
Deepseek-V3.2Zero-shot64.4%60.8%81.1%69.5%36.5%0.276
Deepseek-V3.2CoT61.1%60.9%86.6%71.3%35.0%0.287
Deepseek-V3.2Self-Consistency64.4%60.8%81.1%69.6%36.5%0.276
Qwen3-VL-PlusZero-shot53.0%51.6%97.0%67.4%37.1%0.286
Qwen3-VL-PlusCoT53.5%54.9%61.3%57.9%34.4%0.305
Qwen3-VL-PlusSelf-Consistency53.1%51.7%97.0%67.4%36.7%0.286
Llama-3.1-8B-InstructZero-shot57.3%54.7%85.7%66.7%32.3%0.275
Llama-3.1-8B-InstructCoT50.8%50.4%99.0%66.8%29.1%0.294
Llama-3.1-8B-InstructSelf-Consistency58.8%56.7%85.3%68.1%32.5%0.274
Qwen3-VL-8B-InstructZero-shot51.7%50.9%94.4%66.1%35.3%0.276
Qwen3-VL-8B-InstructCoT41.0%32.7%17.1%22.4%34.1%0.277
Qwen3-VL-8B-InstructSelf-Consistency52.9%51.8%93.6%66.7%35.7%0.274
  • 更长的历史上下文提升时机和意图预测的准确性,约在5分钟后收益趋于递减。
  • 基于Knowledge Graph的长期记忆在相较零-shot基线的改进最大(Accuracy +11.8%,Intention Accuracy +26.9%,F1 +6.1%)。
  • 真实世界训练数据在对多种模型的微调中显著优于合成数据。
  • 提示策略效果参差不齐;链式推理对较大型模型有帮助但对较小/开源模型有害,自洽性带来的增益有限。
  • 在How to Assist的语义相似性任务上模型表现仍偏低,表明内容生成质量仍有提升空间。
(a) Weekday distribution.
(a) Weekday distribution.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。