QUICK REVIEW

[论文解读] Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space

Thanh-Tuan Tran, Thanh Nguyen Canh|arXiv (Cornell University)|Mar 1, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

论文将 TD3 扩展到离散-连续混合动作空间，分析过估计偏差，并提出一个加权裁剪 Q 学习目标，对离散动作分布进行边缘化以提高在全域随机化下的机器人操作的稳定性和性能。

ABSTRACT

Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain randomization. In this paper, we propose Hybrid TD3, an extension of Twin Delayed Deep Deterministic Policy Gradient (TD3) that natively handles parameterized hybrid action spaces in a principled manner. We conduct a rigorous theoretical analysis of overestimation bias in hybrid action settings, deriving formal bounds under twin-critic architectures and establishing a complete bias ordering across five algorithmic variants. Building on this analysis, we introduce a weighted clipped Q-learning target that marginalizes over the discrete action distribution, achieving equivalent bias reduction to standard clipped minimization while improving policy smoothness. Experimental results demonstrate that Hybrid TD3 achieves superior training stability and competitive performance against state-of-the-art hybrid action baselines

研究动机与目标

为机器人操作中的离散-连续混合动作强化学习的稳定性提供动机与解决方案。
在全域随机化下经验性地将标准深度强化学习基线进行比较，以确定 TD3 作为最稳定的主干。
推导五种混合算法的理论偏差排序，并提出用于混合 TD3 的偏差缓解目标。
展示在四个操作任务上的稳定性提升和零样本泛化的竞争性表现。

提出的方法

将带有一个离散二进制动作和一个六自由度连续分量的参数化混合动作空间形式化。
采用双评估器的 TD3 主干并扩展以同时评估离散和连续动作分量。
引入一个加权裁剪 Q 学习目标，在 Bellman 备份中对离散动作分布进行边缘化。
提供理论分析，建立五种混合算法变体之间的偏差排序，并为所选方法提供依据。
描述四个 UF850 机器人操作任务的状态表示、奖励设计与训练协议。

Figure 2 : Our proposed DRL system deviates from the traditional Markov Decision Process (MDP) that not only relies on the current trajectory to decide the future but also combines the past trajectories to help the agent learns smoother. This model processes the environment observation $o_{t}$ that

实验结果

研究问题

RQ1在全域随机化下，混合（离散-连续）动作强化学习中的过估计偏差的影响是什么？
RQ2哪种主干 DRL 算法为混合动作提供最稳定的学习，原因是什么？
RQ3加权、对分布边缘化的目标是否在混合 TD3 中提升策略平滑性并保持偏差性质？
RQ4所提出的方法与偏差相对于操作任务的现有混合基线有何比较？
RQ5在全域随机化下，学习到的策略是否能对未见对象类别实现零样本泛化？

主要发现

Object set	Action 0 (%)	Action 1 (%)	Action 2 (%)	Action 3 (%)
Standard set	94,25 ± 1,92	89,75 ± 4,66	80,75 ± 2,58	83,25 ± 3,56
Unseen set	94,25 ± 1,92	90,00 ± 5,15	81,75 ± 4,66	82,75 ± 2,58

基于 TD3 的混合方法在强域随机化条件下显示出比 SAC、DDPG 及 PPO 基线更高的稳定性与性能。
加权裁剪 Q 学习目标对离散分布进行边缘化，产生更平滑的梯度，同时保留 TD3 类偏差特性。
五种混合变体的理论偏差排序表明，在稠密奖励与强随机化下，混合 TD3 具有最有利的（最低的）期望偏差。
混合 TD3 在四个操作任务中实现最高的最终平均回报，并对未见对象显示零样本泛化。
最终策略在标准和未见对象集合上均表现鲁棒，成功率较高（Reach、Pick、Move、Put）。
对新对象类别的零样本泛化被观测到，标准对象集与未见对象集之间的降解很小。

Figure 4 : Estimation bias of the baselines (top row), estimation bias of the proposed methods (middle row), and average return (bottom row) across four manipulation tasks. Solid curves represent mean performance, while shaded areas indicate standard deviations over four independent random seeds.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。