QUICK REVIEW

[论文解读] RAPT: Model-Predictive Out-of-Distribution Detection and Failure Diagnosis for Sim-to-Real Humanoid Robots

Humphrey Munn, Brendan Tidd|arXiv (Cornell University)|Feb 2, 2026

Robotic Locomotion and Control被引用 0

一句话总结

RAPT 是一个轻量级的部署时监控器，用于50 Hz的人形控制，能够检测分布外执行并为 sim-to-real 迁移提供可解释的事后故障诊断。

ABSTRACT

Deploying learned control policies on humanoid robots is challenging: policies that appear robust in simulation can execute confidently in out-of-distribution (OOD) states after Sim-to-Real transfer, leading to silent failures that risk hardware damage. Although anomaly detection can mitigate these failures, prior methods are often incompatible with high-rate control, poorly calibrated at the extremely low false-positive rates required for practical deployment, or operate as black boxes that provide a binary stop signal without explaining why the robot drifted from nominal behavior. We present RAPT, a lightweight, self-supervised deployment-time monitor for 50Hz humanoid control. RAPT learns a probabilistic spatio-temporal manifold of nominal execution from simulation and evaluates execution-time predictive deviation as a calibrated, per-dimension signal. This yields (i) reliable online OOD detection under strict false-positive constraints and (ii) a continuous, interpretable measure of Sim-to-Real mismatch that can be tracked over time to quantify how far deployment has drifted from training. Beyond detection, we introduce an automated post-hoc root-cause analysis pipeline that combines gradient-based temporal saliency derived from RAPT's reconstruction objective with LLM-based reasoning conditioned on saliency and joint kinematics to produce semantic failure diagnoses in a zero-shot setting. We evaluate RAPT on a Unitree G1 humanoid across four complex tasks in simulation and on physical hardware. In large-scale simulation, RAPT improves True Positive Rate (TPR) by 37% over the strongest baseline at a fixed episode-level false positive rate of 0.5%. On real-world deployments, RAPT achieves a 12.5% TPR improvement and provides actionable interpretability, reaching 75% root-cause classification accuracy across 16 real-world failures using only proprioceptive data.

研究动机与目标

通过解决在 sim-to-real 迁移后出现的安静但高置信度的分布外故障，推动人形机器人对学习策略的可靠部署。
开发一个轻量级、在线检测器，在50 Hz下运行并具备对每个维度的异常信号进行了校准。
通过基于梯度的显著性分析和一个由大语言模型条件化的语义分类器，提供可解释的故障诊断。
实现对因果根源的事后分析，利用本体感知数据与偶尔的视觉线索诊断 Sim-to-Real 不匹配。

提出的方法

在 nominal 的仿真数据上训练一个概率 reconstruction 基的检测器（RAPT），以建模有效 humanoid 行为的时空流形。
使用基于 GRU 的潜在桥接、残差编码器和概率解码器，给出逐维度的 NLL（考虑不确定性的）重建分数。
通过 Sim-to-Real 校准阶段对异常阈值进行标定，并将逐维度与全局门控结合在边界框检测器中。
用约50 Hz 的在线检测，采用三门控系统（逐维度最大值、全局均值和范围检查）实现鲁棒的安全性。
通过对重建 NLL 进行时序反向传播的 Integrated Gradients，计算时序显著性，以在时间和传感器上归因故障。
使用多模态的大语言模型，将结构化的显著性和运动学数据转化为零-shot 的语义根因诊断。
提供基于操作员定义策略的安全响应（如安全停机、受控跌落），而非对自稳控制器的修改。

实验结果

研究问题

RQ1RAPT 在检测 simulated 和真实世界的人形任务中的 OOD 事件方面，是否优于现有最先进的基线？
RQ2在 sim-to-real 的差距下，RAPT 是否能从仿真推广到真实硬件，同时保持低误报率？
RQ3在实际部署中，基于梯度的显著性分析加上 LLM 推理在诊断根因方面的有效性如何？
RQ4各架构组件（时序递归、概率解码、标定、显著性、以及多模态诊断）对检测性能的贡献分别是多少？

主要发现

Method	Latency	Avg AUROC	Throwing	Velocity	Mimic (Dance)	Mimic (Gangnam)	Model Only	Hybrid
Isolation Forest	4.32 ms	0.69	0.24 ±0.04	0.34 ±0.03	0.42 ±0.01	0.38 ±0.00	0.18	0.34
PatchAD	11.45 ms	0.73	0.14 ±0.01	0.16 ±0.03	0.17 ±0.04	0.16 ±0.03	0.18	0.16
Deep SVDD	0.45 ms	0.67	0.29 ±0.03	0.31 ±0.10	0.41 ±0.01	0.37 ±0.01	0.14	0.34
LSTM-VAE	1.77 ms	0.77	0.30 ±0.02	0.36 ±0.02	0.44 ±0.01	0.42 ±0.01	0.32	0.38
Ours (RAPT)	1.63 ms	0.92	0.72 ±0.05	0.74 ∗ ±0.02	0.67 ±0.08	0.75 ±0.02	0.75	0.72
Ours (RAPT) Hybrid	1.63 ms	0.92	0.72 ±0.05	0.74 ∗ ±0.02	0.67 ±0.08	0.75 ±0.02	0.75	0.72

在仿真中，RAPT 在所有任务中均达到最高的安全分数（TPR @ 0.5% FPR）和 AUROC，延迟极低（1.63 ms）。
在相同固定 FPR 条件下，RAPT 以绝对 AUROC 提升 +0.34，相比最强基线（LSTM-VAE）。
在真实硬件上，混合版 RAPT（RAPT 加 Range 检测器）检测到 24 次异常运行中的 18 次（75% 的 RCA 召回率），召回率优于基线。
RAPT 的本体感知显著性诊断，尤其在有视觉帧辅助时，提升了 top-1 和 top-3 根因分类准确性。
诊断管道能够识别静默的 sim-to-real 差异（如 PD 增益），并有助于部署验证，超出简单的范围检查。
该模型支持通过对显著性和运动学条件化的 LLM 实现零-shot 语义故障分类。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。