[論文レビュー] The PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation
Proper introduces a two-agent architecture (DGA and RGA) to proactively navigate knowledge gaps; it outperforms strong baselines across medical, coding, and shopping domains, especially in single-turn tasks and multi-turn interactions.
Most language-based assistants follow a reactive ask-and-respond paradigm, requiring users to explicitly state their needs. As a result, relevant but unexpressed needs often go unmet. Existing proactive agents attempt to address this gap either by eliciting further clarification, preserving this burden, or by extrapolating future needs from context, often leading to unnecessary or mistimed interventions. We introduce ProPer, Proactivity-driven Personalized agents, a novel two-agent architecture consisting of a Dimension Generating Agent (DGA) and a Response Generating Agent (RGA). DGA, a fine-tuned LLM agent, leverages explicit user data to generate multiple implicit dimensions (latent aspects relevant to the user's task but not considered by the user) or knowledge gaps. These dimensions are selectively filtered using a reranker based on quality, diversity, and task relevance. RGA then balances explicit and implicit dimensions to tailor personalized responses with timely and proactive interventions. We evaluate ProPer across multiple domains using a structured, gap-aware rubric that measures coverage, initiative appropriateness, and intent alignment. Our results show that ProPer improves quality scores and win rates across all domains, achieving up to 84% gains in single-turn evaluation and consistent dominance in multi-turn interactions.
研究の動機と目的
- Formalize proactivity as a calibration problem that balances explicit user intent with latent knowledge gaps.
- Introduce a dimension-based representation of user needs and a domain-specific benchmark (ProPerBench) for supervision.
- Propose Proper, a modular two-agent architecture that separates knowledge-gap discovery from response generation.
- Demonstrate improved task utility and timely proactivity across medical, coding, and recommendation domains.
提案手法
- Dimension Generating Agent (DGA) fine-tuned to infer implicit, task-relevant dimensions from user state and to produce candidate gaps.
- Post-hoc calibrated reranker that selects a budgeted subset of candidate dimensions by optimizing a utility objective balancing quality, explicit-need alignment, and diversity.
- Response Generating Agent (RGA) that updates a baseline response by conditioning on explicit dimensions and the activated implicit dimensions.
- End-to-end Proper pipeline: construct interaction state, generate baseline r0, DGA proposes dimensions, reranker selects S_k*, RGA generates updated response that preserves intent while adding targeted proactive information.

実験結果
リサーチクエスチョン
- RQ1RQ1: Does ProPer improve end-to-end task utility across domains compared with strong baselines?
- RQ2RQ2: How do the DGA, reranking, and RGA components individually contribute to performance?
- RQ3RQ3: Are observed gains due to calibrated proactivity rather than mere verbosity?
- RQ4RQ4: Does ProPer maintain robustness in multi-turn conversations?
主な発見
| Model | Medical muScore | Medical Win% | Code-Contests muScore | Code-Contests Win% | PWAB muScore | PWAB Win% |
|---|---|---|---|---|---|---|
| LlaMA-8B | 2.19 | 10.52 | 1.26 | 15.51 | 2.34 | 6.83 |
| LlaMA-8B + Proper | 3.86 † | 89.48 † | 2.13 † | 84.49 † | 4.06 | 93.17 † |
| Qwen-8B | 2.93 | 18.73 | 2.24 | 24.76 | 3.12 | 12.50 |
| Qwen-8B + Proper | 4.03 † | 81.27 † | 2.84 † | 75.24 † | 4.29 † | 87.50 † |
| Gpt-4 | 3.28 | 29.74 | 3.19 † | 68.93 † | 3.46 | 23.61 |
| Gpt-4 + Proper (vs LlaMA-8B + Proper) | 3.73 | 70.26 | 2.08 | 31.07 | 4.11 † | 76.39 † |
| Gpt-4 | 3.26 | 19.26 | 3.11 | 43.63 | 3.53 | 17.40 |
| Gpt-4 + Proper (vs Qwen-8B + Proper) | 4.03 † | 80.74 † | 2.71 | 56.37 | 4.24 † | 82.60 † |
- Proper consistently improves task utility over strong base LLMs and chain-of-thought prompting across medical, coding, and PWAB domains.
- End-to-end gains include up to 84% improvements in single-turn evaluations and dominance in multi-turn interactions.
- Ablations show removing the DGA causes substantial performance drops, while removing the reranker causes smaller degradations, highlighting the importance of implicit-dimension generation.
- DGA-derived dimensions outperform those generated directly by base LLMs, indicating the value of learned latent gaps.
- Calibration parameters controlling activation and diversity (lambda1, lambda2) affect domain sensitivity, with medical and PWAB benefiting from higher activation.
- Multi-turn evaluations show ProPer preferred in 11/12 Medical, 9/12 Code-Contests, and 12/12 PWAB conversations, illustrating stability of calibrated proactivity.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。