QUICK REVIEW

[논문 리뷰] Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning

David Hudák, Maris F. L. Galesloot|arXiv (Cornell University)|2026. 02. 09.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

Lexpop은 DRL로 신경 정책을 학습하고 형식적으로 검증할 수 있는 유한 상태 제어기(FSC)를 추출하며, 최악의 경우 모델에 대해 반복적으로 학습하여 HM-POMDP용 강건한 FSC로 확장한다.

ABSTRACT

Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.

연구 동기 및 목표

DRL과 유한 상태 제어기 추출을 결합하여 대규모 POMDP의 확장 가능한 해결을 가능하게 한다.
추출된 FSC의 형식적 검증을 제공하여 성능을 보장한다.
모델 불확실성을 다루기 위해 숨겨진 모델 POMDP용 강건한 FSC로 프레임워크를 확장한다.
단일 및 HM-POMDP 설정에서 DRL 기반 FSC를 최첨단 모델 기반 해결책과 비교한다.

제안 방법

벡터화된 시뮬레이터를 사용하여 DRL(PPO)로 RNN 기반 신경 정책을 학습한다.
Alergia 또는 자가해석 가능 네트워크(SIG)를 사용하여 신경 정책을 모방하는 확률적 FSC를 추출한다.
마크오프 체인을 구성하고 값을 계산하여 추출된 FSC를 해석적으로 검증한다.
최악의 경우 POMDP에 대해 반복적으로 학습하고 강건한 FSC를 추출하여 Lexpop을 HM-POMDP로 확장한다.
Paynt를 사용하여 유도된 마르코프 체인 가운데 최악의 경우 모델을 효율적으로 탐색한다.
정책 추출이 기본 정책 아키텍처에 독립적으로 블랙박스로 정책을 다루도록 허용한다.

Figure 1 . The overall idea of Lexpop , our RL-based FSC extraction. The robust extensions are depicted in green.

실험 결과

연구 질문

RQ1Lexpop이 POMDP를 위한 최첨단 FSC 합성보다 더 높은 값을 가진 FSC를 구성할 수 있는가?
RQ2FSC 추출이 테스트된 문제들에서 신경 정책의 값을 보존하거나 향상시킬 수 있는가?
RQ3자가해석 가능한 SIG 추출이 자동화 학습(Alergia)보다 충실도를 향상시키는가?
RQ4HM-POMDP에서 Lexpop이 rfPG보다 더 높은 최악의 경우 값을 가진 강건한 FSC를 얻을 수 있는가?
RQ5최악의 경우 POMDP 선택이 HM-POMDP 해결에 효과적으로 결정적 요소인가?

주요 결과

Lexpop은 확장성을 보여주며 여러 벤치마크에서 대형 POMDP에 대해 최첨단 해결책보다 우수한 성능을 보인다.
FSCs extracted from final neural policies via Alergia or SIG achieve competitive or superior values compared to the neural policies in many cases.
SIG-based extraction provides a competitive fidelity with smaller FSCs and maintains robust performance in HM-POMDP settings.
In HM-POMDP experiments, Lexpop variants achieve higher robust values than rfPG on multiple models, with competitive FSC sizes.
Iterative worst-case POMDP selection during RobustLexpop improves robustness across the model family.

Figure 2 . High-level architecture of the self-interpretable Gumbel softmax network.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.