QUICK REVIEW

[论文解读] Regularized Gradient Temporal-Difference Learning

Hyunjun Na, Donghwan Lee|arXiv (Cornell University)|Jan 28, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

本文提出正则化 GTD（R-GTD），一种正则化的鞍点形式，即使特征交互矩阵为奇异也能保证收敛，并给出明确的误差界限与经验验证。

ABSTRACT

Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.

研究动机与目标

在特征交互矩阵奇异下，驱动 GTD-family 方法的稳定性与收敛性。
引入基于正则化 MSPBE 的目标，得到一个良定义的鞍点问题。
为 R-GTD 在奇异与非奇异情形下提供理论保证（收敛性与误差界限）。
在 FIM 奇异时展示 R-GTD 的实证鲁棒性，并与 GTD2 进行比较。

提出的方法

通过添加二次项进行 MSPBE 正则化，并在约束中引入松弛变量 w，形成正则化的最小-最大问题。
推导闭式最优解，展示随着正则化参数 c 增大，R-GTD 如何简化为 GTD2。
提出原-对偶梯度动态（PDGD）更新及其对 off-policy 数据（重要性采样）的随机变体。
利用现有的 PDGD 结果建立连续时间 PDGD 的收敛性，并通过常微分方程方法证明离散时间算法的收敛性。
给出 θ、w、λ 的明确更新规则，形成 R-GTD 算法（算法 1）。
证明当 c→∞ 且 FIM 为非奇异时，R-GTD 收敛到 GTD2；当 FIM 奇异时仍保持良定义。

Figure 1 : As $c\to\infty$ , the R-GTD solution $\theta_{\mathrm{RGTD}}$ converges to the GTD2 solution $\theta_{\mathrm{GTD2}}$ . $\theta_{\mathrm{GTD2}}$ decomposes uniquely into two components: $v\in\mathrm{Null}(G)$ along the null space of $G$ , and $v_{\perp}\in\mathrm{Null}(G)^{\perp}$ orthogo

实验结果

研究问题

RQ1正则化是否可以在 GTD2 中消除对 FIM 非奇异性的假设？
RQ2在奇异 FIM 条件下，正则化形式是否能提供收敛性保证和近似有限样本的误差界？
RQ3R-GTD 解与真实投影解之间的关系如何，正则化参数 c 的影响是什么？
RQ4在使用 R-GTD 相对于 GTD2 的情况下，带函数近似的离策略评估在奇异 FIM 下是否仍然稳定？
RQ5在实践中引入松弛变量 w 与 c 正则化项的理论与经验含义是什么？

主要发现

R-GTD 在不要求 FIM 非奇异性的情况下，提供到唯一鞍点的收敛性。
R-GTD 引入了一个显式偏差项，随着 c 增大而消失，在非奇异情况下可恢复 GTD2。
随着 c 增大，R-GTD 解趋近于 GTD2 解，或者在 FIM 奇异时趋于 GTD2 解集合中的投影。
理论结果包括对连续时间 PDGD 的收敛性保证以及到真实投影解的显式误差界限。
实证结果显示在 FIM 奇异的情形下，R-GTD 稳定收敛，而 GTD2 则不稳定。
无约束重构（问题6）有助于稳定性分析并与 MSPBE 正则化相关联。

Figure 2 : Solution trajectory of the closed-form $\theta_{\mathrm{RGTD}}$ in a two-dimensional singular case toy example. As the regularization parameter $c$ increases, the $\theta_{\mathrm{RGTD}}$ converges to the $\theta_{\mathrm{GTD2}}$ .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。