QUICK REVIEW

[论文解读] VisFly-Lab: Unified Differentiable Framework for First-Order Reinforcement Learning of Quadrotor Control

Fanxing Li, Fangyu Sun|arXiv (Cornell University)|Mar 22, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

该论文提出一个统一的、包裹式的可微分框架用于多任务四旋翼机控制，并引入 Amended Backpropagation Through Time (ABPT) 以应对第一阶强化学习中的 horizon 采样和梯度偏差问题。结果表明 ABPT 在悬停、跟踪、着陆和竞速等任务上均提升性能，并具备初步的现实世界迁移能力的概念验证。

ABSTRACT

First-order reinforcement learning with differentiable simulation is promising for quadrotor control, but practical progress remains fragmented across task-specific settings. To support more systematic development and evaluation, we present a unified differentiable framework for multi-task quadrotor control. The framework is wrapped, extensible, and equipped with deployment-oriented dynamics, providing a common interface across four representative tasks: hovering, tracking, landing, and racing. We also present the suite of first-order learning algorithms, where we identify two practical bottlenecks of standard first-order training: limited state coverage caused by horizon initialization and gradient bias caused by partially non-differentiable rewards. To address these issues, we propose Amended Backpropagation Through Time (ABPT), which combines differentiable rollout optimization, a value-based auxiliary objective, and visited-state initialization to improve training robustness. Experimental results show that ABPT yields the clearest gains in tasks with partially non-differentiable rewards, while remaining competitive in fully differentiable settings. We further provide proof-of-concept real-world deployments showing initial transferability of policies learned in the proposed framework beyond simulation.

研究动机与目标

提供一个统一且可扩展的可微分框架，用于四个任务（悬停、跟踪、着陆、竞速）之间具有共同接口的多任务四旋翼机控制。
在该框架内开发与评估第一阶 RL 方法，解决可微分训练中的实际瓶颈。
提出 Amended Backpropagation Through Time (ABPT)，以缓解 horizon 采样的限制和来自非可微分奖励的梯度偏差。
展示 ABPT 相较基线的经验性能提升，并展示初步的仿真到现实的迁移性。

提出的方法

对四个任务（悬停、跟踪、着陆、竞速）进行包裹与扩展的可微分仿真，使用面向部署的四任务动力学。
以 BPTT、SHAC、PPO 基线为基础，将 ABPT 作为一个 on-policy actor-critic 方法formulate 第一阶梯度训练。
引入 ABPT，将 0-step 与 N-step 回报结合，以减少部分不可微奖励带来的梯度偏差并提升鲁棒性。
使用访问过的状态回放缓冲区从之前看到的状态初始化 horizon，以改善状态空间覆盖。
采用高保真 6-DoF 四旋翼模型，具备 CTBR 控制、执行器动力学，并在 PyTorch 中构建的可微物理引擎。
在四个任务上进行评估，并与基线在样本效率和最终性能方面进行对比。

实验结果

研究问题

RQ1一个统一的可微分框架是否能够在面向部署的动力学下支持多种四旋翼机控制任务？
RQ2在悬停、跟踪、着陆、竞速等任务上训练时，第一阶 RL 方法是否可以从统一接口中获益？
RQ3ABPT 是否能够缓解 horizon 引起的状态覆盖限制和来自非可微奖励的梯度偏差？
RQ4在各任务上，与 PPO、BPTT 和 SHAC 相比，ABPT 的性能与鲁棒性提升程度如何？
RQ5在该框架中学习的策略是否具有初步的现实世界四旋翼机的可迁移性？

主要发现

在部分不可微奖励的任务（如着陆和竞速）中，ABPT 显示出最明显的收益。
ABPT 在统一基准内具有竞争力，并且在前三个任务中通常比基线收敛更快。
PPO 稳定但在样本效率方面较慢，原因在于缺乏解析梯度。
BPTT 在不可微奖励设置下，尤其是竞速任务，存在梯度偏差和采样效率低下的问题。
SHAC 在评论家方面方差较高，且在某些任务中因非微分分量导致表现劣于 ABPT。
概念验证的现实世界部署显示出框架中学习策略的初步可迁移性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。