QUICK REVIEW

[论文解读] RynnBrain: Open Embodied Foundation Models

Ronghao Dang, Jiayan Guo|arXiv (Cornell University)|Feb 13, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

RynnBrain 是一组开源具身基础模型家族（2B、8B、30B-A3B MoE），具有四大核心能力和后训练变体，在28项具身基准和20项通用视觉任务上表现出色。它还引入物理 grounding 的链式推理与专用数据管线，实现可扩展、物理感知的具身智能。

ABSTRACT

Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.

研究动机与目标

发展一个统一的时空基础模型，明确以物理环境为基础，支持具身任务的感知、推理和规划。

提出的方法

基于 Qwen3-VL 变体的解码器式视觉-语言架构，具备视觉编码器、视觉-语言投影和LLM 主干。
两种密集模型规模（2B、8B）以及一个 MoE 30B-A3B 模型，以适应不同的计算预算。
统一的时空表示，将视频帧转换为时序嵌入的视觉标记。
使用离散坐标标记的输出空间，用于边界框、点和轨迹的定位。
带有时空记忆和物理 grounding 的物理感知预训练，以及利用预训练先验和人工监督的数据管线。
后训练变体（RynnBrain-CoP、RynnBrain-Nav、RynnBrain-Plan、RynnBrain-VLA）用于专门的具身任务。

实验结果

研究问题

RQ1如何在一个物理 grounding 的基础模型中，将感知、推理与规划整合为一个统一的具身任务基础模型？
RQ2统一的时空模型是否能提升在不同环境和任务中的鲁棒性，且后训练变体如何扩展能力？
RQ3哪些数据、训练策略和评估基准最能揭示在自我认知、定位和规划等方面的具身能力？

主要发现

RynnBrain 在 20 项具身基准和 8 项通用视觉基准上显著超越现有的具身基础模型。
RynnBrain-CoP 在轨迹预测基准上将复杂时空推理任务提升约 7%。
RynnBrain-Nav 在不同模型尺度下在 R2R 和 RxR 基准上实现了最先进的结果。
RynnBrain-VLA 展现出稳健的操作规划和 VLA 执行能力，输出带有 grounding 信息。
完整数据集与基准（超过 2000 万个样本；RynnBrain-Bench）支持具身智能的可扩展、可重复开发。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。