QUICK REVIEW

[论文解读] Information-Driven Active Perception for k-step Predictive Safety Monitoring

Sumukha Udupa, Jie Fu|arXiv (Cornell University)|Mar 24, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

该论文提出一种信息理论的主动感知策略，在传感器查询预算约束下，为部分可观测系统的 k 步预测安全性不确定性最小化而设计，使用带标签的 HMM 和基于 DFA 的安全规范，通过可观测算子的策略梯度求解。

ABSTRACT

This work studies the synthesis of active perception policies for predictive safety monitoring in partially observable stochastic systems. Operating under strict sensing and communication budgets, the proposed monitor dynamically schedules sensor queries to maximize information gain about the safety of future states. The underlying stochastic dynamics are captured by a labeled hidden Markov model (HMM), with safety requirements defined by a deterministic finite automaton (DFA). To enable active information acquisition, we introduce minimizing k-step Shannon conditional entropy of the safety of future states as a planning objective, under the constraint of a limited sensor query budget. Using observable operators, we derive an efficient algorithm to compute the k-step conditional entropy and analyze key properties of the conditional entropy gradient with respect to policy parameters. We validate the effectiveness of the method for predictive safety monitoring through a dynamic congestion game example.

研究动机与目标

在部分可观测性和资源约束下激发预测性安全监控.
将主动感知表述为最小化安全结果的 k 步预测熵。
开发利用可观测算子的策略梯度方法，以生成预算感知的传感策略。

提出的方法

将环境建模为带可控输出的带标签 HMM，安全性由 DFA 指定。
构造乘积 HMM，将系统动力学与安全标记耦合在一起。
将 k 步预测安全性定义为在 k 步内进入失效状态事件的熵，条件是观测历史。
利用可观测算子推导条件熵对策略参数的梯度。
提出带切换成本正则化的策略梯度学习规则，以在信息增益和感知成本之间取得平衡。
提供基于样本的观测历史梯度估计近似。

Figure 1: Environment topological graph with sensor coverages.

实验结果

研究问题

RQ1在传感预算约束下，如何主动调度传感器查询，以在未来 k 步内是否发生安全违规的不确定性最小化？
RQ2在部分可观测的环境中，是否可以高效地用信息理论目标（k 步条件熵）通过可观测算子进行优化？
RQ3学习得到的主动感知策略在预测安全监测中，与随机感知和具有完美信息（oracle）的情况相比表现如何？
RQ4感知切换成本对学习到的感知策略和预测性能有何影响？

主要发现

Horizon (k)	Uniform Random Brier Score	Uniform Random Cost	Trained Policy Brier Score	Trained Policy Cost	Oracle Brier Score	Oracle Cost	% Imprv.
1	0.1791 ± 0.0075	227.07 ± 4.00	0.0564 ± 0.0029	200.65 ± 2.24	0.0149 ± 0.0006	68.53 ± 0.00	74.72%
3	0.1880 ± 0.0066	224.07 ± 4.07	0.0799 ± 0.0035	198.11 ± 2.00	0.0386 ± 0.0016	195.84 ± 1.87	72.33%
5	0.1931 ± 0.0058	230.31 ± 3.80	0.0939 ± 0.0040	195.84 ± 1.87	0.0576 ± 0.0025	194.85 ± 1.91	73.21%
10	0.2007 ± 0.0055	224.57 ± 4.14	0.1255 ± 0.0049	194.55 ± 1.78	0.0921 ± 0.0040	194.55 ± 1.78	69.24%
15	0.2012 ± 0.0069	230.18 ± 3.81	0.1395 ± 0.0059	194.85 ± 1.91	0.1124 ± 0.0051	194.85 ± 1.91	69.48%

所提出的方法在多个 k 步时域上，相对于均匀随机感知基线，降低了预测安全错误率。
学习到的策略接近拥有完美状态信息的 oracle 的性能，在 Brier 分数上有显著下降（例如 k=1 时平均从 0.1791 降至 0.0564）。
信息增益与感知成本之间存在明确权衡；增大成本参数 alpha 会减少传感使用，同时预测不确定性略有上升。
策略训练在基于梯度的学习规则下收敛，在 {1,3,5,10,15} 的各个时域上与 oracle 的差距显著缩小。
结果显示随着时域 k 增长，Brier 分数增大表示不确定性增加，但学习的策略对随机感知的改进仍然显著，约实现 30.66% 到 68.53% 的差距缩小。
该方法在一个动态拥堵博弈中得到验证，说明基于信息驱动的主动感知在预测性安全方面具有实际效果。

Figure 4: Comparison of $k$ -step prediction accuracy with posterior sampling.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。