Skip to main content
QUICK REVIEW

[论文解读] Deep Intrinsic Surprise-Regularized Control (DISRC): A Biologically Inspired Mechanism for Efficient Deep Q-Learning in Sparse Environments

Yash Kini, Shiv Davay|arXiv (Cornell University)|Jan 24, 2026
Reinforcement Learning in Robotics被引用 0
一句话总结

DISRC 通过在潜在空间引入惊讶信号来动态缩放 Q 更新,从而在稀疏奖励环境中提升学习效率与稳定性。它在 MiniGrid 任务上显示出更快的早期收敛和更高的一致性。

ABSTRACT

Deep reinforcement learning (DRL) has driven major advances in autonomous control. Still, standard Deep Q-Network (DQN) agents tend to rely on fixed learning rates and uniform update scaling, even as updates are modulated by temporal-difference (TD) error. This rigidity destabilizes convergence, especially in sparse-reward settings where feedback is infrequent. We introduce Deep Intrinsic Surprise-Regularized Control (DISRC), a biologically inspired augmentation to DQN that dynamically scales Q-updates based on latent-space surprise. DISRC encodes states via a LayerNorm-based encoder and computes a deviation-based surprise score relative to a moving latent setpoint. Each update is then scaled in proportion to both TD error and surprise intensity, promoting plasticity during early exploration and stability as familiarity increases. We evaluate DISRC on two sparse-reward MiniGrid environments, which included MiniGrid-DoorKey-8x8 and MiniGrid-LavaCrossingS9N1, under identical settings as a vanilla DQN baseline. In DoorKey, DISRC reached the first successful episode (reward > 0.8) 33% faster than the vanilla DQN baseline (79 vs. 118 episodes), with lower reward standard deviation (0.25 vs. 0.34) and higher reward area under the curve (AUC: 596.42 vs. 534.90). These metrics reflect faster, more consistent learning - critical for sparse, delayed reward settings. In LavaCrossing, DISRC achieved a higher final reward (0.95 vs. 0.93) and the highest AUC of all agents (957.04), though it converged more gradually. These preliminary results establish DISRC as a novel mechanism for regulating learning intensity in off-policy agents, improving both efficiency and stability in sparse-reward domains. By treating surprise as an intrinsic learning signal, DISRC enables agents to modulate updates based on expectation violations, enhancing decision quality when conventional value-based methods fall short.

研究动机与目标

  • 提高在稀疏奖励环境中深度 Q 学习的样本效率与稳定性。
  • 引入一种生物启发的机制,基于内部惊讶来调节更新幅度。
  • 在稀疏 MiniGrid 任务中将 DISRC 与原生 DQN 进行对比,并量化学习速度与稳定性的提升。
  • 演示潜在空间对移动设定点的偏离如何调节学习动态。

提出的方法

  • 引入基于 LayerNorm 的编码器将观测映射到 64 维潜在空间。
  • 从相对于移动潜在设定点的偏离中计算潜在空间惊讶分数。
  • 同时考虑 TD 误差与惊讶强度来缩放每次 Q 更新。
  • 用基于惊讶的项调制外部奖励以影响学习更新。
  • 在标准 DQN 框架下训练,整合 DISRC 组件,包括经验回放和软目标更新。

实验结果

研究问题

  • RQ1与原生 DQN 相比,DISRC 在稀疏奖励环境中是否提升了样本效率?
  • RQ2潜在空间惊讶调制是否带来更稳定的学习与更低的奖励方差?
  • RQ3DISRC 如何影响 MiniGrid 任务的收敛速度与最终性能?
  • RQ4引入内部惊讶信号的权衡与计算考虑有哪些?
  • RQ5DISRC 是否能够在 MiniGrid 基准中对不同的稀疏奖励场景进行泛化?

主要发现

  • 在 MiniGrid-DoorKey-8x8 中,DISRC 首次在 79 集合取得成功 Episode,而 DQN 需要 118 集(提升 33%)。
  • DISRC 在 DoorKey 的奖励标准差为 0.25,低于 DQN 的 0.34。
  • DISRC 在 DoorKey 的 AUC 为 596.42,高于 DQN 的 534.90。
  • 在 MiniGrid-LavaCrossingS9N1 中,DISRC 的最终平均奖励为 0.95,高于 DQN 的 0.93。
  • 在 LavaCrossing 中,DISRC 达到最高的 AUC (957.04),而 DQN 为 934.82,尽管收敛速度较慢。
  • DISRC 展现出更强的长期泛化能力和更稳定的学习曲线,在两个环境中均有体现。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。