QUICK REVIEW

[論文レビュー] ProAgentBench: Evaluating LLM Agents for Proactive Assistance with Real-World Data

Yuanbo Tang, Huaze Tang|arXiv (Cornell University)|Feb 4, 2026

Personal Information Management and User Behavior被引用数 0

ひとこと要約

ProAgentBench は実世界のプライバシー保護データセットと、いつ支援するか、どう支援するかを評価する2段階フレームワークを導入し、実データと長期的文脈が性能を向上させることを示す。

ABSTRACT

Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human-AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM-synthesized data that fails to capture authentic human decision-making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre-assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy-compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM- and VLM-based baselines. Numerically, we showed that long-term memory and historical context significantly enhance prediction accuracy, while real-world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench-6BC0.

研究の動機と目的

現実世界のワークフローシナリオにおける積極的エージェントのベンチマークを作成する。
事前支援コンテキストとともに真の長期的なユーザー相互作用データを捉える。
体系的評価のための二段階フレームワーク（When to Assist, How to Assist）を定式化する。
現実世界データと長期記憶が積極的介入に及ぼす影響を定量化する。
今後の研究を指針づけるために、LLMsとVLMsのベースラインを提供する。

提案手法

積極的支援をタイミング予測と内容生成に分解する階層的な“When + How”フレームワークを開発する。
bursty な相互作用パターンを保持しつつ、500時間以上の実ユーザーセッションから28,528件のイベントを含むプライバシー適合データセットを構築する。
匿名化、人間-in-the-loop レビュー、自動LLMベースのイベント注釈を含むデータ収集パイプラインを使用する。
プロンプトベース（Zero-shot, CoT, Self-Consistency）やメモリベース（RAG, Knowledge Graph, Clustering）を含む、さまざまなLLMおよびVLMのベースラインを評価する。
履歴観測長さと長期的ユーザー文脈が予測と生成タスクに与える影響を分析する。

Figure 1 : Illustration of Proactive Agent Workflow. The agent continuously monitors user screen activities and contextual signals. When assistance is needed, it proactively determines when to intervene and how to assist based on historical observations and user behavior patterns.

実験結果

リサーチクエスチョン

RQ1歴史的観測シーケンスの長さは When to Assist と How to Assist の性能にどのように影響するか？
RQ2長期的ユーザー文脈（メモリ）を組み込むことが積極的支援にどのような影響を与え、どのメモリ戦略が最も効果的か？
RQ3現実世界のトレーニングデータは積極的エージェントの微調整において合成データを上回るか？
RQ4プロンプト戦略（Zero-shot, CoT, Self-Consistency）はモデル間で性能を助けるか、妨げるか？
RQ5積極的介入における現実世界の生産性を反映する実用的指標は何か？

主な発見

Model	Method	When to Assist Accuracy	When to Assist Precision	When to Assist Recall	When to Assist F1 Score	How to Assist Intention Acc.	How to Assist Sem. Sim.
GPT-4o-mini	Zero-shot	54.9%	52.7%	96.2%	68.1%	28.4%	0.280
GPT-4o-mini	CoT	55.7%	55.6%	99.5%	71.3%	30.5%	0.298
GPT-4o-mini	Self-Consistency	55.2%	52.8%	96.0%	68.2%	28.2%	0.280
Qwen3-Max	Zero-shot	59.3%	55.5%	93.4%	69.7%	36.3%	0.285
Qwen3-Max	CoT	59.8%	59.6%	72.5%	65.4%	38.2%	0.305
Qwen3-Max	Self-Consistency	59.5%	55.7%	93.5%	69.9%	36.2%	0.285
Deepseek-V3.2	Zero-shot	64.4%	60.8%	81.1%	69.5%	36.5%	0.276
Deepseek-V3.2	CoT	61.1%	60.9%	86.6%	71.3%	35.0%	0.287
Deepseek-V3.2	Self-Consistency	64.4%	60.8%	81.1%	69.6%	36.5%	0.276
Qwen3-VL-Plus	Zero-shot	53.0%	51.6%	97.0%	67.4%	37.1%	0.286
Qwen3-VL-Plus	CoT	53.5%	54.9%	61.3%	57.9%	34.4%	0.305
Qwen3-VL-Plus	Self-Consistency	53.1%	51.7%	97.0%	67.4%	36.7%	0.286
Llama-3.1-8B-Instruct	Zero-shot	57.3%	54.7%	85.7%	66.7%	32.3%	0.275
Llama-3.1-8B-Instruct	CoT	50.8%	50.4%	99.0%	66.8%	29.1%	0.294
Llama-3.1-8B-Instruct	Self-Consistency	58.8%	56.7%	85.3%	68.1%	32.5%	0.274
Qwen3-VL-8B-Instruct	Zero-shot	51.7%	50.9%	94.4%	66.1%	35.3%	0.276
Qwen3-VL-8B-Instruct	CoT	41.0%	32.7%	17.1%	22.4%	34.1%	0.277
Qwen3-VL-8B-Instruct	Self-Consistency	52.9%	51.8%	93.6%	66.7%	35.7%	0.274

より長い歴史的文脈はタイミングと意図予測を改善するが、約5分を境に収穫逓減が生じる。
Knowledge Graph ベースの長期記憶は、ゼロショットベースラインに対して最大の性能向上をもたらす（Accuracy +11.8%、Intention Accuracy +26.9%、F1 +6.1%）。
実世界データはモデル間の微調整において、合成データを大幅に上回る。
プロンプティング戦略は混合的な効果を示す；チェイン・オブ・思考は大規模モデルには有効だが、小規模/オープンモデルには害となり得る，そして自己整合性は限定的な利得を提供する。
How to Assist の意味的類似性のモデル性能は相対的に低く、生成コンテンツ品質の改善余地を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。