QUICK REVIEW

[論文レビュー] Boosting deep Reinforcement Learning using pretraining with Logical Options

Zihan Ye, Phil Chau|arXiv (Cornell University)|Mar 6, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

H2 RL は、微分可能な記号推論を事前学習と統合し、深層RLを長期 horizon・目的志向の挙動へ導くための論理情報を用いた事前学習を導入し、推論時の記号推論なしで性能を向上させる。

ABSTRACT

Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans' ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.

研究の動機と目的

深層 RL における報酬ハックや短期的利用によるポリシーのミスアラインメントを動機付け・解決する。
記号論理とオプションワーカーを用いた二段階の Hybrid Hierarchical RL フレームワークを提案する。
最終ポリシーの標準的な環境相互作用による微調整を可能としつつ、推論を効率的に保つ。

提案手法

微分可能な論理マネージャとゲーティングモジュールを用いて事前学習中に論理事前分布を注入する。
記号状態 z_t に guided されるサブタスクを実行する固定の事前学習済みオプションワーカーで事前学習を行う。
ニューラルポリシーを並列に訓練し、それらの出力をMixture-of-Expertsゲーティングモジュールで結合する。
環境相互作用を用いたポスト訓練でパフォーマンスを向上させる（H2 RL ++）。
価値関数に論理とニューラル critics を含むハイブリッドポリシーで PPO ベースの目的関数を最適化する。
探索のエントロピー正則化と論理対ニューラル制御のバランスを促すゲーティングエントロピーを含める。

Figure 1 : Deep reinforcement learning policies are often misaligned , exemplified on neural PPO agents. Although the oxygen is running low in Seaquest (left) and the goal in Kangaroo (right) is to go up, PPO agent fails to choose the optimal actions (in green). Instead, they focus on immediate rewa

実験結果

リサーチクエスチョン

RQ1H2 RL は難易度の高い長期的な RL タスクでベースラインと比較してどの程度性能を発揮するか？
RQ2H2 RL の事前学習は他の深層RL（オンポリシー・オフポリシー）手法をブーストできるか？
RQ3論理情報を伴う事前学習はポリシーのミスアラインメントと報酬トラップを緩和できるか？
RQ4H2 RL の各構成要素（論理、事前学習済みオプション、ゲーティング）が性能に与える貢献は何か？
RQ5H2 RL は連続アクション空間にスケールするか？

主な発見

H2 RL のバリアントは、長期的な依存関係と報酬トラップを含む難易度の高い Atari タスクでベースラインを大きく上回る。
論理情報を用いた事前学習は、オンポリシー・オフポリシーの両方の手法をブーストする普遍的な基盤となる。
事前学習はポリシーのミスアラインメントを緩和し、ベースエージェントが失敗する Kangaroo の階を agents が登ることを可能にする。
論理ガイダンスとニューロンの柔軟性の相乗効果が重要で、論理単独またはニューラル単独の構成は H2 RL に勝らない。
H2 RL は CALE の連続アクション空間でも性能を向上させ、PPOや階層ベースのベースラインを上回る。
ポスト訓練（H2 RL ++）は事前学習のみよりも大幅な利得を生み、二段階アプローチの妥当性を裏付ける。

Figure 2 : Overview of the framework. Through logic-informed pretraining, H 2 RL embeds logic priors directly into neural policies, thereby addressing the deep policy misalignment issue. H 2 RL provides a two-stage training paradigm. In the first stage, the deep policy is jointly trained with the lo

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。