QUICK REVIEW

[論文レビュー] Deep Intrinsic Surprise-Regularized Control (DISRC): A Biologically Inspired Mechanism for Efficient Deep Q-Learning in Sparse Environments

Yash Kini, Shiv Davay|arXiv (Cornell University)|Jan 24, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

DISRCは latent-space のサプライズ信号を用いて Q 更新を動的にスケールし、スパース報酬環境で学習効率と安定性を向上させる。 MiniGridタスクで早期収束が速く、安定性が高いことを示す。

ABSTRACT

Deep reinforcement learning (DRL) has driven major advances in autonomous control. Still, standard Deep Q-Network (DQN) agents tend to rely on fixed learning rates and uniform update scaling, even as updates are modulated by temporal-difference (TD) error. This rigidity destabilizes convergence, especially in sparse-reward settings where feedback is infrequent. We introduce Deep Intrinsic Surprise-Regularized Control (DISRC), a biologically inspired augmentation to DQN that dynamically scales Q-updates based on latent-space surprise. DISRC encodes states via a LayerNorm-based encoder and computes a deviation-based surprise score relative to a moving latent setpoint. Each update is then scaled in proportion to both TD error and surprise intensity, promoting plasticity during early exploration and stability as familiarity increases. We evaluate DISRC on two sparse-reward MiniGrid environments, which included MiniGrid-DoorKey-8x8 and MiniGrid-LavaCrossingS9N1, under identical settings as a vanilla DQN baseline. In DoorKey, DISRC reached the first successful episode (reward > 0.8) 33% faster than the vanilla DQN baseline (79 vs. 118 episodes), with lower reward standard deviation (0.25 vs. 0.34) and higher reward area under the curve (AUC: 596.42 vs. 534.90). These metrics reflect faster, more consistent learning - critical for sparse, delayed reward settings. In LavaCrossing, DISRC achieved a higher final reward (0.95 vs. 0.93) and the highest AUC of all agents (957.04), though it converged more gradually. These preliminary results establish DISRC as a novel mechanism for regulating learning intensity in off-policy agents, improving both efficiency and stability in sparse-reward domains. By treating surprise as an intrinsic learning signal, DISRC enables agents to modulate updates based on expectation violations, enhancing decision quality when conventional value-based methods fall short.

研究の動機と目的

スパース報酬における深層Q学習のサンプル効率と安定性の改善を動機づける。
内部サプライズに基づいて更新量を調整する生物学的インスピレーション機構を導入する。
スパースなMiniGridタスクにおいて DISRC をバニラ DQN と比較し、学習速度と安定性の改善を定量化する。
移動するセットポイントに対する潜在空間の偏差が学習ダイナミクスを調整する方法を示す。

提案手法

Observation を 64 次元潜在空間へマッピングする LayerNorm ベースのエンコーダを導入する。
移動する潜在セットポイントへの偏差から潜在空間サプライズスコアを計算する。
TD誤差とサプライズ強度の両方で各 Q 更新をスケールする。
サプライズベースの項で外部報酬を調整し学習更新に影響を与える。
DISRC コンポーネントを統合した標準的な DQN フレームワークを用いて訓練する（経験リプレイおよびソフトターゲット更新を含む）。

実験結果

リサーチクエスチョン

RQ1DISRC はスパース報酬環境においてバニラ DQN と比較してサンプル効率を改善するか。
RQ2潜在空間サプライズ調整はより安定した学習と報酬分散の低下をもたらすか。
RQ3DISRC は MiniGrid タスクにおける収束速度と最終性能にどのように影響するか。
RQ4内部サプライズ信号の導入に伴うトレードオフと計算面の考慮事項は何か。
RQ5DISRC は MiniGrid ベンチマーク内の異なるスパース報酬シナリオへ一般化できるか。

主な発見

MiniGrid-DoorKey-8x8 において、DISRC は DQN より 79 エピソードで初回成功を達成（118 エピソードの DQN より 33% 速い）。
DoorKey では報酬の標準偏差が DISRC の方が低く (0.25) DQN の (0.34) より安定。
DoorKey では AUC が DISRC (596.42) が DQN (534.90) を上回る。
MiniGrid-LavaCrossingS9N1 では最終平均報酬が DISRC (0.95) > DQN (0.93) 。
LavaCrossing では最高 AUC が DISRC (957.04) に対し DQN (934.82)、ただし収束はより緩やかだった。
DISRC は両環境を通じて長期的な一般化と学習曲線の安定性の点でより強力な性能を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。