QUICK REVIEW

[Paper Review] Off-Policy Evaluation via Off-Policy Classification

Alexander Irpan, Kanishka Rao|arXiv (Cornell University)|Jan 1, 2019

Reinforcement Learning in Robotics15 citations

TL;DR

This paper proposes a novel off-policy evaluation (OPE) method for deep reinforcement learning in continuous control with sparse binary rewards, framing OPE as a positive-unlabeled (PU) classification problem to avoid reliance on environment models or importance sampling. It demonstrates superior performance in predicting relative policy performance, especially in sim-to-real transfer for image-based robotic manipulation tasks.

ABSTRACT

In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE of value-based methods, which are of particular interest in deep RL with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Furthermore, existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.

Motivation & Objective

To address the high cost and impracticality of on-policy evaluation in real-world deep RL applications.
To develop a reliable off-policy evaluation metric for value-based methods in continuous action spaces with sparse binary rewards.
To eliminate dependence on environment models or importance sampling, which are unstable or infeasible for high-dimensional observations like images.
To enable effective model selection, hyperparameter tuning, and early stopping in simulation before real-world deployment.
To improve generalization performance prediction for policies transferred from simulation to real-world robotic manipulation tasks.

Proposed method

Formulates off-policy evaluation as a positive-unlabeled (PU) classification problem, where demonstrated trajectories are treated as positive and others as unlabeled.
Uses a classifier to estimate the probability that a given trajectory was generated by the target policy, using features extracted from state-action pairs.
Employs a contrastive learning objective to improve feature representation quality for the classifier, enhancing discrimination between on-policy and off-policy trajectories.
Avoids importance sampling and environment model fitting by relying solely on the relative ranking of trajectories via classification.
Trains the classifier on a mix of on-policy and off-policy trajectories collected from behavior policies, using only the relative order of performance for supervision.
The final OPE score is derived from the classifier’s predicted probability of on-policy trajectories, serving as a proxy for expected return.

Experimental results

Research questions

RQ1Can a PU classification-based metric reliably estimate the relative performance of policies without using environment models or importance sampling?
RQ2How well does the proposed method generalize to sim-to-real transfer in image-based robotic manipulation tasks?
RQ3Does the method outperform existing OPE baselines in terms of correlation with true on-policy performance in high-dimensional, continuous control settings?
RQ4How robust is the method to distributional shift and sparse binary rewards in continuous action spaces?
RQ5Can the method support effective hyperparameter tuning and early stopping in real-world RL deployment pipelines?

Key findings

The proposed PU-based OPE method achieves higher correlation with true on-policy performance than existing baselines, especially in high-dimensional observation settings such as image inputs.
The method successfully predicts relative policy performance across multiple generalization scenarios, including sim-to-real transfer for robotic manipulation tasks.
It outperforms model-based and importance sampling-based OPE methods in terms of stability and accuracy when dealing with continuous action spaces and sparse binary rewards.
The classifier-based approach shows robustness to distributional shift and maintains reliable performance even when behavior policies are significantly different from the target policy.
The method enables effective model selection and early stopping in simulation, reducing the need for costly real-world rollouts.
Empirical results demonstrate that the PU classification metric correlates strongly with actual returns in both tabular and continuous control environments with image observations.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.