QUICK REVIEW

[논문 리뷰] Reward Prediction with Factorized World States

Yijun Shen, Delong Chen|arXiv (Cornell University)|2026. 03. 10.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

논문은 관찰을 객체-속성 계층으로 변환하여 보상을 제로샷으로 예측하는 의미론적 인수분해 방법 StateFactory를 도입하고, RewardPrediction 벤치마크를 제시하여 도메인 간 보상 품질을 평가한다. StateFactory는 강력한 제로샷 EPIC 거리 감소 및 계획 성능 향상을 달성한다.

ABSTRACT

Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: https://statefactory.github.io

연구 동기 및 목표

새로운 목표와 환경에 대해 TASK-특정 감독 없이도 강인한 보상 예측을 가능하게 한다.
언어 모델을 사용하여 세계 상태를 객체-속성 계층으로 의미론적으로 인수분해한다.
계층적 제약 하에서 현 상태와 목표 상태 사이의 의미론적 유사성으로 보상을 추정한다.
제로샷 보상 일반화와 그것이 다양한 도메인에서의 계획 성능에 미치는 영향을 평가한다.
텍스트 기반 환경의 단계별 보상 품질을 엄격하게 평가하는 벤치마크(RewardPrediction)를 제공한다.

제안 방법

관찰을 구조화된 객체-속성 상태로 분해하고 동적 목표 상태를 반복적으로 확인-grounding하는 StateFactory를 도입한다.
정체성과 진화하는 속성을 가진 객체 인스턴스 집합을 생성하는 순환적 상태 추출 함수를 사용한다.
반복적 목표 해석 함수를 통해 목표를 동적 목표 표현으로 접지한다.
계층적 객체-속성 정렬을 통해 현 상태와 접지된 목표 상태 간의 의미론적 유사성으로 보상을 계산한다.
다섯 도메인에서 실제 보상과 단계별 보상 간의 접지를 EPIC 거리로 평가한다.

실험 결과

연구 질문

RQ1제로샷 StateFactory가 baselines보다 더 정확한 보상 신호를 제공하는가?
RQ2StateFactory가 감독 보상 모델보다 보지 않은 도메인에 더 잘 일반화하는가?
RQ3표현의 세분성(객체 대 객체-속성)이 성능에 어떤 영향을 미치는가?
RQ4임베딩, 백본 및 추론 능력의 선택에 대해 StateFactory의 견고성은 어느 정도인가?

주요 결과

StateFactory는 제로샷 평균 EPIC 거리 0.297를 달성하여 최고 표현 자유 베이스라인을 능가하고 감독 상한에 근접했다.
감독 보상 모델은 새로운 도메인에 일반화가 잘 되지 않으며 보상 예측 오차가 평균적으로 138% 증가했다.
StateFactory의 세분화된 객체-속성 상태 표현은 노이즈를 줄이고 목표와의 정렬을 개선한다.
LLM 추론 능력과 임베딩 식별력이 증가하면 보상 정렬이 향상된다.
StateFactory의 보상 신호는 계획 성과로 이어지며, 예를 들어 AlfWorld에서 성공률이 +21.64%, ScienceWorld에서 +12.40% 증가했다 (ReAct + StateFactory 시나리오).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.