QUICK REVIEW

[논문 리뷰] From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism

Sarthak Wanjari|arXiv (Cornell University)|2026. 02. 09.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

Geo-IQL은 미리 계산된 기하학적 거리 페널티로 OOD 행동을 벌하고, fracture된 데이터에서 안정성을 높이며 로봇공학 및 패혈증 관리 데이터셋에서 더 안전하고 고품질의 정책을 달성한다.

ABSTRACT

Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data manifolds. Current solutions necessitate a trade-off between computational efficiency and performance. Methods like CQL offer rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair, our method injects OOD conservatism via reward shaping with a O(1) training overhead to the training loop. Evaluated on the D4RL MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while reducing inter-seed standard-deviation by 4 times. Furthermore, Geo-IQL does not degrade performance on stable manifolds. Crucially, we validate our algorithm on the MIMIC-III Sepsis critical care dataset. While standard IQL collapses to behaviour cloning, Geo-IQL demonstrates active policy improvement. Maintaining safety constraints, it achieves 86.4% terminal agreement with clinicians compared to IQL's 75%. Our results suggest that geometric pessimism provides the necessary regularisation to safely overcome local optima in critical, real-world decision systems.

연구 동기 및 목표

데이터가 잘게 분해되거나 희박한 고위험 도메인에서 더 안전한 오프라인 RL을 고무한다.
샘플 내 학습에 기하학적 페널티를 추가하는 계산적으로 효율적인 방법을 제안한다.
훈련 오버헤드를 O(1)로 유지하기 위한 페널티의 사전 계산을 가능하게 한다.
로봇 공학 벤치마크와 중환자 관리 데이터에서 향상된 안정성과 정책 품질을 입증한다.

제안 방법

상태-행동 쌍을 결합 공간에 매핑하고 기하학적 불확실성 대리 지표로 평균 kNN 거리를 계산한다.
MAD를 사용해 거리의 표준화를 견고하게 수행하고 안전 핵심 임계값으로 등급화된 위험 표면(U)을 만든다.
밀도 적응 페널티를 적용하여 보상을 조정한다: r_geo(s,a)=r(s,a)−λ_adapt·max(0,U(s,a)).
훈련 중 O(1) 페널티 검색을 달성하기 위해 조회 테이블에 페널티를 사전 계산한다.
평가자가 동일하게 남아 있도록 IQL 목표와 페널티를 통합하고 평가자(critic)가 페널티가 적용된 보상으로 학습되도록 한다.

Figure 1: 3-Dimensional visualisation of using geometry as a proxy for epistemic uncertainty.

실험 결과

연구 질문

RQ1데이터 매니폴드까지의 기하학적 거리가 오프라인 RL에서의 인식 가능한 불확실성의 대리 지표로 작용할 수 있는가?
RQ2사전 계산된 기하학 기반 페널티를 추가하면 fracture된 데이터에서 IQL 성능이 안정성을 해치지 않으면서 향상되는가?
RQ3표준 IQL 및 CQL과 비교하여 Geo-IQL가 MIMIC-III Sepsis와 같은 고위험 실세계 데이터에서 어떻게 성능을 보이는가?
RQ4이 접근법이 소형 하드웨어로도 계산 효율적으로 실행될 수 있는가?
RQ5의료 분야에서 기하학 기반 오프라인 RL을 통해 어떤 안전성 및 임상가 정렬 이점이 발생하는가?

주요 결과

Task	BC	CQL	IQL	Geo-IQL
halfcheetah-medium-replay-v2	27.69 ± 10.92	45.41 ± 0.81	43.68 ± 4.15	42.52 ± 3.04
hopper-medium-replay-v2	51.87 ± 20.26	82.60 ± 21.10	80.09 ± 21.80	98.94 ± 5.33
walker2d-medium-replay-v2	43.17 ± 25.77	78.28 ± 18.85	80.17 ± 17.89	82.10 ± 13.39

Geo-IQL은 D4RL MuJoCo 스위트의 Hopper-medium-replay-v2에서 표준 IQL보다 18점 이상 우수하다.
Geo-IQL은 민감한 작업에서 시드 간 표준편차를 약 4배 감소시킨다.
Geo-IQL은 데이터가 밀집한 영역에서 성능을 유지하며 안정적인 매니폴드에서 IQL과 일치한다.
MIMIC-III Sepsis에서 Geo-IQL은 Terminal State Agreement를 더 높게 달성한다(86.39% 대 75.02%).
Geo-IQL은 양의 Q-향상(ΔQ = +0.0138)을 달성하는 반면 IQL은 음의 ΔQ(−0.0169)를 보인다.
Geo-IQL은 안전성을 유지하면서 중환자 관리에서 목표 정책 개선을 가능하게 한다.

Figure 2: Visualising the Adaptive Safety Mechanism. The blue cloud represents the training data manifold. Green Point: A query inside the dense region. The algorithm detects near-zero distance to neighbours and applies no penalty. Yellow Point: A query slightly off the manifold. This triggers the a

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.