QUICK REVIEW

[论文解读] PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

Tianmeng Hu, Biao Luo|arXiv (Cornell University)|Mar 20, 2026

Advanced Multi-Objective Optimization Algorithms被引用 0

一句话总结

PA2D-MORL 通过基于帕累托上升方向的分解以及进化式多策略 MORL 框架，在连续控制任务中以改善的稳定性近似高质量帕累托前沿。它在多个基于 MuJoCo 的目标上超过最先进方法。

ABSTRACT

Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.

研究动机与目标

在 DRL 中推动多目标决策，当目标相互冲突（如速度 vs. 能耗）时的应用。
提出基于帕累托上升方向的分解，在没有预设偏好下优化多策略。
开发一个进化式的多策略 MORL 框架以探索并覆盖帕累托前沿。
引入 PA-FT 以密集并扩展帕累托前沿的近似。
在七个基于 MuJoCo 的多目标任务上展示最先进的性能与稳定性。

提出的方法

将 MORL 表述为最大化目标回报向量 J(π) 并通过权重 ω 进行标量化得到 J(π;ω)=ω^T J^π。
计算聚合策略梯度 ∇_θJ^π(ω)=∑_i ω_i ∇_θJ_i^π_θ 以引导优化朝向帕累托前沿。
通过求解 min_{α≥0, ∑α_i=1} ||∑_i α_i ∇_θJ_i^π||^2 来得到 Pareto 上升方向 ∝*，并将其作为优化方向（无先验目标偏好）。
维护一个非支配策略集，并通过进化式多代循环更新策略。
使用分区贪婪随机化策略选择（PGR）在目标空间的分区中更新多样策略。
应用 Pareto 自适应微调（PA-FT）通过定位大缺失区域和目标端点来密集化并扩展前沿。

实验结果

研究问题

RQ1帕累托上升方向是否能提供无损失、无目标偏好前提下同时改进所有目标的方向？
RQ2具有帕累托上升梯度的进化式多策略框架是否比基于预测模型的 MORL 方法产生更高质量且更稳定的帕累托前沿？
RQ3PA-FT 是否在不同环境中充分密集并扩展帕累托前沿？
RQ4提出的分解与策略选择策略与连续控制任务中的最先进 MORL 基线相比有何优势？

主要发现

Environment	HV PA2D-MORL	HV PA2D-ablated	HV PGMORL	HV PFA	HV MOEA/D	SP PA2D-MORL	SP PA2D-ablated	SP PGMORL	SP PFA	SP MOEA/D
Walker2d	5.743±0.121	5.320±0.186	4.849±0.558	4.329±0.553	4.612±0.545	0.014±0.006	0.180±0.096	0.021±0.018	0.309±0.225	0.710±0.285
Humanoid	51.23±2.66	42.93±4.14	44.75±5.81	40.55±5.02	46.35±7.33	0.133±0.031	0.274±0.177	0.255±0.121	0.715±0.516	2.871±1.342
HalfCheetah	5.787±0.020	5.741±0.053	5.782±0.018	5.765±0.081	5.739±0.075	0.026±0.013	0.106±0.035	0.022±0.015	0.548±0.209	0.679±0.295
Hopper-2	22.09±0.57	21.30±0.68	19.10±2.41	20.61±4.31	20.73±1.17	0.503±0.107	0.1 868±0.389	0.559±0.529	4.485±2.219	2.346±0.672
Ant	6.814±0.167	6.242±0.294	6.283±0.277	6.209±0.464	6.233±0.477	0.209±0.019	0.351±0.047	0.832±0.457	1.021±0.554	1.696±0.581
Swimmer	3.187±0.056	2.965±0.336	2.566±0.595	2.392±0.467	2.323±0.531	0.550±0.207	0.603±0.241	0.917±0.862	1.976±0.582	2.601±1.094
Hopper-3	3.889±0.191	3.759±0.277	3.766±0.254	-	3.681±0.434	0.021±0.013	0.106±0.052	0.032±0.011	-	0.642±0.215

PA2D-MORL 在所有七个 MuJoCo 环境中获得最佳超体积（HV）分数，相较基线方法。
PA2D-MORL 在多数环境中通常实现最密集的帕累托前沿（最低的 SP），但有些例外（如 HalfCheetah，PGMORL 具竞争力）。
PA2D-MORL 展示出在 HV 和 SP 的跨运行稳定性更优（标准差更低）。
在去除 PA-FT 的情况下，前沿密度下降，突出 PA-FT 在密集帕累托近似中的作用。
PA2D-MORL 在许多设置中优于 PGMORL 和 MOEA/D 的变体，尤其在 Humanoid 和 Walker2d，显示出帕累托上升定向分解相对于预测型或传统进化方法的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。