QUICK REVIEW

[论文解读] Manifold-Constrained Energy-Based Transition Models for Offline Reinforcement Learning

Zeyu Fang, Zuyuan Zhang|arXiv (Cornell University)|Feb 2, 2026

Model Reduction and Neural Networks被引用 0

一句话总结

MC-ETM 学习具流形感知的能量基转移模型，并使用带悲观惩罚的能量引导截断来提升离线强化学习策略学习，特别是在分布偏移和不连续动力学下。

ABSTRACT

Model-based offline reinforcement learning is brittle under distribution shift: policy improvement drives rollouts into state--action regions weakly supported by the dataset, where compounding model error yields severe value overestimation. We propose Manifold-Constrained Energy-based Transition Models (MC-ETM), which train conditional energy-based transition models using a manifold projection--diffusion negative sampler. MC-ETM learns a latent manifold of next states and generates near-manifold hard negatives by perturbing latent codes and running Langevin dynamics in latent space with the learned conditional energy, sharpening the energy landscape around the dataset support and improving sensitivity to subtle out-of-distribution deviations. For policy optimization, the learned energy provides a single reliability signal: rollouts are truncated when the minimum energy over sampled next states exceeds a threshold, and Bellman backups are stabilized via pessimistic penalties based on Q-value-level dispersion across energy-guided samples. We formalize MC-ETM through a hybrid pessimistic MDP formulation and derive a conservative performance bound separating in-support evaluation error from truncation risk. Empirically, MC-ETM improves multi-step dynamics fidelity and yields higher normalized returns on standard offline control benchmarks, particularly under irregular dynamics and sparse data coverage.

研究动机与目标

解决在分布偏移下，回滚访问到支持不足区域时模型基离线强化学习的脆弱性。
引入几何感知的能量基转移学习，以提高与数据集边界的锐利对齐。
将学习到的能量 E_theta(s,a,s') 作为可靠性信号，在截断回滚时使用，并应用悲观的 Q 值惩罚。
给出理论保证，将在支持内的评估误差与截断风险区分开来。
展示在标准离线基准上的动力学保真度和回报的提升，尤其是对不规则动力学和稀疏数据。

提出的方法

提出流形投影-扩散（MPD）以学习下一状态的潜在流形。
在潜在扰动和Langevin动力学下训练条件能量基转移，以生成接近流形的难负样本。
将能量 E_theta(s,a,s') 作为可靠性信号，在 min_s' E_theta(s,a,s') 超过阈值 delta 时截断回滚。
基于能量引导样本的 Q 值离散，使用悲观惩罚来稳定 Bellman 备份。
将该方法形式化为混合悲观 MDP，给出将支持内误差与截断风险分离的保守性能界限。
对能量约束算子及其性能界进行理论分析。

Figure 1 : An illustrative example on fitting a discontinuous transition function.

实验结果

研究问题

RQ1能否通过对流形的负样本学习来提升离线 RL 的能量基转移建模？
RQ2在离线基准上，特别是在不规则动力学和稀疏覆盖下，MC-ETM 是否提升动力学保真度与回报？
RQ3能量基截断与基于集成的悲观是否能稳定离线策略优化？
RQ4混合悲观形式如何界定相对于离线最优的性能差距？

主要发现

任务名称	CQL	TD3+BC	EDAC	MOPO	COMBO	RAMBO	MOBILE	EMPO*	ETM
halfcheetah-r	31.3	11.0	28.4	38.5	38.8	39.5	39.3	14.3	40.7 ± 1.1
hopper-r	5.3	8.5	25.3	31.7	17.9	25.4	31.9	30.9	31.8 ± 0.3
walker-r	5.4	1.6	16.6	7.4	7.0	0.0	17.9	13.7	19.6 ± 1.3
halfcheetah-m	46.9	48.3	65.9	73.0	54.2	77.9	74.6	21.2	76.9 ± 0.6
hopper-m	61.9	59.3	101.6	62.8	97.2	87.0	106.6	32.9	107.0 ± 1.1
walker-m	79.5	83.7	92.5	84.1	81.9	84.9	87.7	55.4	92.7 ± 0.7
halfcheetah-m-r	45.3	44.6	61.3	72.1	55.1	68.7	71.7	8.4	72.4 ± 1.5
hopper-m-r	86.3	60.9	101.0	103.5	89.5	99.5	103.9	34.9	104.8 ± 0.8
walker-m-r	76.8	81.8	87.1	85.6	56.0	89.2	89.9	66.1	90.2 ± 1.3
halfcheetah-m-e	95.0	90.7	106.3	90.8	90.0	95.4	108.2	28.1	105.2 ± 2.9
hopper-m-e	96.9	98.0	110.7	81.6	111.1	88.2	112.6	41.8	113.8 ± 0.9
walker-m-e	109.1	110.1	114.7	112.9	103.3	56.7	115.2	76.2	114.9 ± 1.8
Average	61.6	58.2	76.0	70.3	66.8	67.7	80.0	35.3	80.8

MC-ETM 在多个环境中相对于 MLP、扩散模型和标准 ETM 展现了更低的预测误差，包括在分布外区域。
在 D4RL MuJoCo 基准上，MC-ETM 在随机、中等、中等回放和中等专业数据集上实现了最优化的归一化回报。
能量基截断有效防止探索进入高能量（OOD）区域，提升训练稳定性。
将 Q 值进行能量引导惩罚的集成化可以减少价值过高评估并使策略更新更稳定。
流形约束的负样本提升了数据集支持附近的能量景观锐度，从而更好地建模不连续动力学。

Figure 2 : Conceptual visualization of energy landscapes on a 2D slice of the high-dimensional state space

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。