[论文解读] ProAgentBench: Evaluating LLM Agents for Proactive Assistance with Real-World Data
ProAgentBench 引入一个真实世界、隐私保护的数据集以及一个两阶段框架,用于评估主动AI代理在何时提供帮助以及如何提供帮助的能力,显示真实世界数据和长期上下文可以提升性能。
Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human-AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM-synthesized data that fails to capture authentic human decision-making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre-assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy-compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM- and VLM-based baselines. Numerically, we showed that long-term memory and historical context significantly enhance prediction accuracy, while real-world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench-6BC0.
研究动机与目标
- 在真实工作流场景中为主动代理建立基准。
- 捕捉带有事前协助上下文的真实、长期用户互动数据。
- 提出一个两阶段框架(When to Assist, How to Assist)用于系统性评估。
- 量化真实世界数据和长期记忆对主动干预的影响。
- 提供跨LLM与VLM的基线,为未来研究提供指南。
提出的方法
- 提出一个分层的“When + How”框架,将主动协助分解为时机预测与内容生成。
- 组装一个符合隐私要求的数据集,包含来自500+小时真实用户会话的28,528个事件,保留爆发式交互模式。
- 使用带匿名化、人工在环审核和基于LLM的自动事件标注的数据收集流水线。
- 评估一组多样化的LLM和VLM基线,包括基于提示的方法(Zero-shot、CoT、Self-Consistency)以及基于记忆的方法(RAG、Knowledge Graph、Clustering)。
- 分析历史观测长度和长期用户上下文对预测与生成任务的影响。

实验结果
研究问题
- RQ1历史观测序列长度如何影响 When to Assist 与 How to Assist 的性能?
- RQ2纳入长期用户上下文(记忆)对主动援助的影响如何,哪种记忆策略最有效?
- RQ3真实世界训练数据在微调主动代理方面是否优于合成数据?
- RQ4提示策略(Zero-shot、CoT、Self-Consistency)在不同模型上的帮助或阻碍程度如何?
- RQ5能反映主动干预现实世界生产力的实用指标是什么?
主要发现
| Model | Method | When to Assist Accuracy | When to Assist Precision | When to Assist Recall | When to Assist F1 Score | How to Assist Intention Acc. | How to Assist Sem. Sim. |
|---|---|---|---|---|---|---|---|
| GPT-4o-mini | Zero-shot | 54.9% | 52.7% | 96.2% | 68.1% | 28.4% | 0.280 |
| GPT-4o-mini | CoT | 55.7% | 55.6% | 99.5% | 71.3% | 30.5% | 0.298 |
| GPT-4o-mini | Self-Consistency | 55.2% | 52.8% | 96.0% | 68.2% | 28.2% | 0.280 |
| Qwen3-Max | Zero-shot | 59.3% | 55.5% | 93.4% | 69.7% | 36.3% | 0.285 |
| Qwen3-Max | CoT | 59.8% | 59.6% | 72.5% | 65.4% | 38.2% | 0.305 |
| Qwen3-Max | Self-Consistency | 59.5% | 55.7% | 93.5% | 69.9% | 36.2% | 0.285 |
| Deepseek-V3.2 | Zero-shot | 64.4% | 60.8% | 81.1% | 69.5% | 36.5% | 0.276 |
| Deepseek-V3.2 | CoT | 61.1% | 60.9% | 86.6% | 71.3% | 35.0% | 0.287 |
| Deepseek-V3.2 | Self-Consistency | 64.4% | 60.8% | 81.1% | 69.6% | 36.5% | 0.276 |
| Qwen3-VL-Plus | Zero-shot | 53.0% | 51.6% | 97.0% | 67.4% | 37.1% | 0.286 |
| Qwen3-VL-Plus | CoT | 53.5% | 54.9% | 61.3% | 57.9% | 34.4% | 0.305 |
| Qwen3-VL-Plus | Self-Consistency | 53.1% | 51.7% | 97.0% | 67.4% | 36.7% | 0.286 |
| Llama-3.1-8B-Instruct | Zero-shot | 57.3% | 54.7% | 85.7% | 66.7% | 32.3% | 0.275 |
| Llama-3.1-8B-Instruct | CoT | 50.8% | 50.4% | 99.0% | 66.8% | 29.1% | 0.294 |
| Llama-3.1-8B-Instruct | Self-Consistency | 58.8% | 56.7% | 85.3% | 68.1% | 32.5% | 0.274 |
| Qwen3-VL-8B-Instruct | Zero-shot | 51.7% | 50.9% | 94.4% | 66.1% | 35.3% | 0.276 |
| Qwen3-VL-8B-Instruct | CoT | 41.0% | 32.7% | 17.1% | 22.4% | 34.1% | 0.277 |
| Qwen3-VL-8B-Instruct | Self-Consistency | 52.9% | 51.8% | 93.6% | 66.7% | 35.7% | 0.274 |
- 更长的历史上下文提升时机和意图预测的准确性,约在5分钟后收益趋于递减。
- 基于Knowledge Graph的长期记忆在相较零-shot基线的改进最大(Accuracy +11.8%,Intention Accuracy +26.9%,F1 +6.1%)。
- 真实世界训练数据在对多种模型的微调中显著优于合成数据。
- 提示策略效果参差不齐;链式推理对较大型模型有帮助但对较小/开源模型有害,自洽性带来的增益有限。
- 在How to Assist的语义相似性任务上模型表现仍偏低,表明内容生成质量仍有提升空间。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。