QUICK REVIEW

[论文解读] Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Jello Zhou, Vudtiwat Ngampruetikorn|arXiv (Cornell University)|Mar 17, 2026

Diffusion and Search Dynamics被引用 0

一句话总结

本论文表明，在强化学习中，随机重置可以在表格型和深度 RL 中加速策略收敛，同时在截断无信息轨迹并改善奖励传播的情况下保持最优策略并加速学习。

ABSTRACT

Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.

研究动机与目标

研究随机重置如何在非学习环境中与强化学习交互。
确定重置是否在改进搜索效率之外也能加速策略收敛。
区分重置机制与贴现因子在塑造学习动态和最终策略中的作用。
在表格型和深度 RL 设置下，评估连续和离散任务中的重置效应。

提出的方法

通过在每一步训练时以概率 r 将智能体返回到固定起始状态来实现随机重置，在重置转移时不更新价值函数。
分析三个环境：GridWorld 和 WindyCliff（表格 Q 学习）以及 MountainCar（DQN），以研究对学习和策略的影响。
将重置与折扣因子 gamma 进行比较，以区分对策略与收敛速度的影响。
使用基于训练步数样本效率和最终策略表现的评估指标。
在重置之间保持累计知识，使重置改变轨迹分布而非学习到的价值函数。
在材料与方法部分提供完整的算法细节和超参数，以便复现。

实验结果

研究问题

RQ1随机重置在强化学习中是否超越搜索效率的提升来加速策略收敛？
RQ2重置与折扣因子（gamma）如何共同作用于学习动态和最终策略？
RQ3在探索困难且奖励稀疏的深度 RL 情况下，重置是否有益？
RQ4在连续状态任务中，与标准 RL 动态相比，重置是否会改变最优策略？

主要发现

即使在较小的网格会降低搜索效率时，GridWorld 中的重置仍能加速策略收敛。
通过截断长的探索性轨迹，重置加速价值信息的向后传播而不改变最优策略，从而改变学习过程。
WindyCliff 中，重置通过改变收敛速度来加速学习而不改变学习到的策略，这一机制与折扣因子不同。
在 MountainCar 的 DQN 中，中等重置率在奖励稀疏时通过增加到达目标的机会来改善学习，而过高的重置率则不利。
当长时间的无信息轨迹成为主要瓶颈时，重置的收益最大，并且收益受探索难度和奖励结构的影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。