[论文解读] Online Bayesian Goal Inference for Boundedly-Rational Planning Agents
论文提出 SIPS,一种序列蒙特卡洛方法,用于在线从最优和次优计划中推断代理的目标,通过将代理建模为在搜索与执行之间交错的有边界的理性规划者。
People routinely infer the goals of others by observing their actions over time. Remarkably, we can do so even when those actions lead to failure, enabling us to assist others when we detect that they might not achieve their goals. How might we endow machines with similar capabilities? Here we present an architecture capable of inferring an agent's goals online from both optimal and non-optimal sequences of actions. Our architecture models agents as boundedly-rational planners that interleave search with execution by replanning, thereby accounting for sub-optimal behavior. These models are specified as probabilistic programs, allowing us to represent and perform efficient Bayesian inference over an agent's goals and internal planning processes. To perform such inference, we develop Sequential Inverse Plan Search (SIPS), a sequential Monte Carlo algorithm that exploits the online replanning assumption of these models, limiting computation by incrementally extending inferred plans as new actions are observed. We present experiments showing that this modeling and inference architecture outperforms Bayesian inverse reinforcement learning baselines, accurately inferring goals from both optimal and non-optimal trajectories involving failure and back-tracking, while generalizing across domains with compositional structure and sparse rewards.
研究动机与目标
- Motivates the need to infer goals from sub-optimal or failed plans as humans do.
- Proposes a generative model of boundedly rational planning agents interacting with a symbolic environment.
- Develops Sequential Inverse Plan Search (SIPS), an online SMC algorithm leveraging replanning to limit computation.
- Embeds goals, states, and observations in a PDDL-based framework to support diverse domains.
- Evaluates the approach against Bayesian IRL baselines across multiple domains and human-subject benchmarks.
提出的方法
- Model agents as probabilistic programs with a goal prior, plan updates, action selection, and state transitions.
- Represent goals and states using PDDL to handle diverse domains and sparse rewards.
- Model sub-optimal planning via a stochastic boundedly-rational search with a random planning budget sampled from a negative binomial distribution.
- Perform online inference with Sequential Inverse Plan Search (SIPS), a particle-based method that extends hypothesized plans as observations arrive.
- Use resampling and two rejuvenation kernels (heuristic-driven goal proposals and error-driven replanning proposals) to maintain hypothesis diversity.
- Implement inference in Gen with planning-domain embedding and leverage online partial-plan extension to keep computation tractable.
实验结果
研究问题
- RQ1Can online Bayesian inference recover an agent's goal from sub-optimal or failed sequences of actions?
- RQ2How does modeling boundedly rational planning (replanning with limited search) affect the ability to infer goals online?
- RQ3Does SIPS outperform Bayesian IRL baselines in accuracy and speed across diverse planning domains?
- RQ4How robust is the approach to model mismatch and to human-like planning behavior?
- RQ5Can the framework generalize to domains with compositional structure and sparse rewards?
主要发现
| Domain | Method | P(g_true|o) Q1 | P(g_true|o) Q2 | P(g_true|o) Q3 | Top-1 | C0 (s) | MC (s) | AC (s) | N | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Taxi (3 Goals) | SIPS (ours) | 0.44 | 0.50 | 0.62 | 0.53 | 0.56 | 0.67 | 13.0 | 1.80 | 2.55 | 1429 |
| Taxi (3 Goals) | BIRL (unbiased) | 0.34 | 0.35 | 0.79 | 0.33 | 0.42 | 0.92 | 2.23 | 0.00 | 0.16 | 10000 |
| Taxi (3 Goals) | BIRL (oracle) | 0.37 | 0.47 | 0.81 | 0.42 | 0.44 | 0.86 | 1.63 | 0.00 | 0.12 | 2500 |
| Doors, Keys & Gems (3 Goals) | SIPS (ours) | 0.37 | 0.51 | 0.61 | 0.74 | 0.74 | 0.74 | 3.30 | 0.70 | 0.86 | 2099 |
| Doors, Keys & Gems (3 Goals) | BIRL (unbiased) | 0.33 | 0.33 | 0.33 | 0.33 | 0.33 | 0.33 | 3326 | 0.12 | 154 | 250000 |
| Doors, Keys & Gems (3 Goals) | BIRL (oracle) | 0.37 | 0.36 | 0.42 | 0.44 | 0.60 | 0.80 | 150 | 0.12 | 7.01 | 10000 |
| Block Words (5 Goals) | SIPS (ours) | 0.47 | 0.83 | 0.90 | 0.78 | 0.84 | 0.91 | 20.8 | 2.46 | 4.15 | 2506 |
| Block Words (5 Goals) | BIRL (unbiased) | 0.20 | 0.20 | 0.21 | 0.42 | 0.49 | 0.56 | 687 | 0.27 | 63.6 | 250000 |
| Block Words (5 Goals) | BIRL (oracle) | 0.20 | 0.29 | 0.45 | 0.73 | 0.80 | 0.96 | 22.2 | 0.05 | 2.12 | 10000 |
| Intrusion Detection (20 Goals) | SIPS (ours) | 0.56 | 0.87 | 0.87 | 0.65 | 0.87 | 0.87 | 375 | 6.60 | 28.0 | 13321 |
| Intrusion Detection (20 Goals) | BIRL (unbiased) | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 18038 | 0.75 | 1069 | 250000 |
| Intrusion Detection (20 Goals) | BIRL (oracle) | 0.09 | 0.24 | 0.53 | 0.94 | 1.00 | 1.00 | 98 | 0.02 | 6.00 | 10000 |
- SIPS accurately infers goals from both optimal and non-optimal trajectories, including backtracking and failures.
- Across domains, SIPS often outperforms unbiased Bayesian IRL in accuracy and speed, sometimes matching or outperforming oracle IRL with substantially less computation.
- SIPS yields higher estimates of the true goal posterior P(g_true|o) than baselines in several domains.
- Human-inference patterns over time correlate more strongly with SIPS than with the BIRL baseline, indicating human-like reasoning.
- SIPS demonstrates robustness to moderate mismatches between the data-generating process and the assumed agent model, and remains effective on human data.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。