[论文解读] Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning
tldr: Lexpop trains a neural policy with DRL and extracts finite-state controllers (FSCs) that can be formally verified, extending to robust FSCs for HM-POMDPs by iteratively training against worst-case models.
Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.
研究动机与目标
- Enable scalable solving of large POMDPs by combining DRL with finite-state controller extraction.
- Provide formal verification of extracted FSCs to guarantee performance.
- Extend the framework to robust FSCs for hidden-model POMDPs to handle model uncertainty.
- Compare DRL-based FSCs against state-of-the-art model-based solvers in single and HM-POMDP settings.
提出的方法
- Train an RNN-based neural policy with DRL (PPO) using a vectorized simulator.
- Extract a stochastic FSC that mimics the neural policy using Alergia or a self-interpretable network (SIG).
- Verify the extracted FSC analytically by constructing a Markov chain and computing its value.
- Extend Lexpop to HM-POMDPs by iteratively training against worst-case POMDPs and extracting robust FSCs.
- Use Paynt to efficiently search among induced Markov chains for worst-case models.
- Allow policy extraction to treat policies as black boxes, independent of the underlying policy architecture.

实验结果
研究问题
- RQ1Can Lexpop construct FSCs with higher value than state-of-the-art FSC synthesis for POMDPs?
- RQ2Can the FSC extraction preserve or improve the neural policy value across tested problems?
- RQ3Does the self-interpretable SIG extraction improve fidelity over automata learning (Alergia)?
- RQ4In HM-POMDPs, can Lexpop yield robust FSCs with higher worst-case value than rfPG?
- RQ5Is worst-case POMDP selection critical for solving HM-POMDPs effectively?
主要发现
- Lexpop demonstrates scalability, outperforming state-of-the-art solvers on large POMDPs in several benchmarks.
- FSCs extracted from final neural policies via Alergia or SIG achieve competitive or superior values compared to the neural policies in many cases.
- SIG-based extraction provides a competitive fidelity with smaller FSCs and maintains robust performance in HM-POMDP settings.
- In HM-POMDP experiments, Lexpop variants achieve higher robust values than rfPG on multiple models, with competitive FSC sizes.
- Iterative worst-case POMDP selection during RobustLexpop improves robustness across the model family.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。