QUICK REVIEW

[论文解读] Testing of Deep Reinforcement Learning Agents with Surrogate Models

Matteo Biagiola, Paolo Tonella|arXiv (Cornell University)|May 22, 2023

Reinforcement Learning in Robotics参考文献 63被引用 7

一句话总结

本文介绍 Indago，一种基于搜索的 DRL 代理测试方法，利用在训练交互上训练的代理环境模型来预测失败并引导配置搜索，在失败数量和多样性方面超过最先进的随机抽样。

ABSTRACT

Deep Reinforcement Learning (DRL) has received a lot of attention from the research community in recent years. As the technology moves away from game playing to practical contexts, such as autonomous vehicles and robotics, it is crucial to evaluate the quality of DRL agents. In this paper, we propose a search-based approach to test such agents. Our approach, implemented in a tool called Indago, trains a classifier on failure and non-failure environment (i.e., pass) configurations resulting from the DRL training process. The classifier is used at testing time as a surrogate model for the DRL agent execution in the environment, predicting the extent to which a given environment configuration induces a failure of the DRL agent under test. The failure prediction acts as a fitness function, guiding the generation towards failure environment configurations, while saving computation time by deferring the execution of the DRL agent in the environment to those configurations that are more likely to expose failures. Experimental results show that our search-based approach finds 50% more failures of the DRL agent than state-of-the-art techniques. Moreover, such failures are, on average, 78% more diverse; similarly, the behaviors of the DRL agent induced by failure configurations are 74% more diverse.

研究动机与目标

推动在现实世界场景中部署的 DRL 代理的鲁棒性测试，不仅限于游戏环境。
利用训练时的交互数据构建环境的代理模型。
开发一种基于搜索的方法来生成能诱发 DRL 代理失败的挑战性环境配置。

提出的方法

在 DRL 训练交互数据（环境配置、失败标签）上训练一个代理分类器（或回归器）。
将代理模型作为适应度函数，指导生成新环境配置的基于搜索的方法。
应用爬山算法或遗传算法，在环境扰动下最大化预测失败，同时保持有效性约束。
可选地从训练中观察到的已知失败配置开始对搜索进行种子初始化。
仅在最具潜在失败的配置上执行 DRL 代理以节省计算。

实验结果

研究问题

RQ1代理模型引导的搜索是否能够暴露出比最先进的采样更多的 DRL 失败？
RQ2由代理引导的搜索发现的失败配置在环境因素和 DRL 行为方面是否具有更大多样性？
RQ3分类器与回归器作为代理模型在引导失败搜索方面表现如何？
RQ4用已知的失败配置对搜索进行种子化对有效性有何影响？

主要发现

Indago 比最先进的采样多发现约 50% 的 DRL 失败。
Indago 发现的失败配置在环境设置方面的多样性比采样高约 77%。
Indago 生成的失败所诱发的 DRL 代理行为多样性约高出 74%。
通过仅在高预测失败的配置上执行 DRL 代理来节省计算。
实验设置包含三个复杂案例研究：停车、步行人形机器人，以及自动驾驶汽车任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。