QUICK REVIEW

[論文レビュー] Testing of Deep Reinforcement Learning Agents with Surrogate Models

Matteo Biagiola, Paolo Tonella|arXiv (Cornell University)|May 22, 2023

Reinforcement Learning in Robotics参考文献 63被引用数 7

ひとこと要約

この論文は、訓練時の相互作用から学習したサロゲート環境モデルを用いて失敗を予測し、設定検索をガイドするDRLエージェント向けの探索型テスト手法 Indago を提案する。従来のランダムサンプリングより多くの失敗と多様性を達成。

ABSTRACT

Deep Reinforcement Learning (DRL) has received a lot of attention from the research community in recent years. As the technology moves away from game playing to practical contexts, such as autonomous vehicles and robotics, it is crucial to evaluate the quality of DRL agents. In this paper, we propose a search-based approach to test such agents. Our approach, implemented in a tool called Indago, trains a classifier on failure and non-failure environment (i.e., pass) configurations resulting from the DRL training process. The classifier is used at testing time as a surrogate model for the DRL agent execution in the environment, predicting the extent to which a given environment configuration induces a failure of the DRL agent under test. The failure prediction acts as a fitness function, guiding the generation towards failure environment configurations, while saving computation time by deferring the execution of the DRL agent in the environment to those configurations that are more likely to expose failures. Experimental results show that our search-based approach finds 50% more failures of the DRL agent than state-of-the-art techniques. Moreover, such failures are, on average, 78% more diverse; similarly, the behaviors of the DRL agent induced by failure configurations are 74% more diverse.

研究の動機と目的

現実世界の文脈で展開されたDRLエージェントの堅牢なテストを動機づける。
訓練時の相互作用データを利用して環境のサロゲートモデルを構築する。
DRLエージェントの故障を誘発する挑戦的な環境設定を生成する検索ベースの手法を開発する。

提案手法

DRL訓練データ（環境設定、故障ラベル）でサロゲート分類器（または回帰器）を訓練する。
サロゲートを適合度関数として用い、新しい環境設定の生成を探索ベースでガイドする。
Hill Climbing または Genetic Algorithm を適用し、妥当性制約を維持しつつ予測故障を最大化する環境変異を探索する。
訓練中に観測された既知の故障設定を探索のシードとして用いることも可能。
計算資源を節約するため、DRLエージェントは最も有望な設定のみに対して実行する。

実験結果

リサーチクエスチョン

RQ1サロゲートモデル guided search は最先端のサンプリングよりも多くのDRL故障を露出させるか。
RQ2サロゲート guided search で見つかった故障設定は環境要因とDRL挙動の多様性をより高めるか。
RQ3故障探索を導くサロゲートとして分類器と回帰器はどちらがより良い性能を示すか。
RQ4既知の故障設定で探索をシードした場合の有効性にどのような影響があるか。

主な発見

Indago は従来の最先端サンプリングより約50%多くのDRL故障を発見する。
Indago が発見した故障設定は、環境設定の観点でサンプリングより約77%多様性を示す。
Indago により誘発されたDRLエージェント挙動は約74%多様である。
高予測故障を示す設定のみでDRLエージェントを実行することで計算資源を節約する。
実験設定には parking、walking humanoid、self-driving car の3つの複雑なケーススタディを含む。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。