QUICK REVIEW

[論文レビュー] RynnBrain: Open Embodied Foundation Models

Ronghao Dang, Jiayan Guo|arXiv (Cornell University)|Feb 13, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

RynnBrainはオープンソースの具現化基盤モデルファミリーで、2B、8B、30B‑A3B MoEのバリアントを持ち、4つの核能力と後学習バリアントを備える。28の具現化ベンチマークと20の一般視覚タスクで高い性能を発揮するほか、物理 grounded chain-of-point 推論とスケーラブルなデータパイプラインを新たに導入。

ABSTRACT

Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.

研究の動機と目的

身体環境に明示的に根ざした統一時空基盤モデルを開発し、具現タスクの知覚、推論、計画をサポートする。

提案手法

Qwen3-VL系のビジョン・言語アーキテクチャをデコーダーのみで実装し、ビジョンエンコーダ、ビジョン–言語プロジェクター、LLMバックボーンを組み込む。
2つの密なモデルサイズ（2B、8B）とMoEの30B‑A3Bモデルを用意して、異なる計算予算に対応。
ビデオフレームを時系列的に埋め込まれた視覚トークンへ変換する統一的な時空表現。
境界ボックス、点、軌跡の離散座標トークンを用いた物理 grounded 出力空間。
時空記憶と物理 grounding を備えた物理 aware pretraining、事前学習済プリオリと人間の監督を活用したデータパイプライン。
専門的な具現タスク向けのポスト訓練バリアント（RynnBrain-CoP、RynnBrain-Nav、RynnBrain-Plan、RynnBrain-VLA）。

実験結果

リサーチクエスチョン

RQ1具現タスクのために、知覚、推論、計画を単一の物理 grounded 基盤モデルにどのように統合するか？
RQ2統一的な時空モデルは多様な環境・タスクで頑健性を向上させるか、ポスト訓練バリアントは能力をどう拡張するか？
RQ3自己視点の認知、局在、計画の具現力を最もよく示すデータ、学習戦略、評価ベンチマークは何か？

主な発見

RynnBrainは20の具現ベンチマークと8つの一般視覚ベンチマークで、既存の具現基盤モデルを大幅に上回る。
RynnBrain-CoPは軌道予測ベンチマークで複雑な時空推論タスクを約7%改善。
RynnBrain-Navはモデル規模に関係なくR2RおよびRxRベンチマークで最先端の結果を達成。
RynnBrain-VLAは、根拠に基づく出力を伴う堅牢な操作計画とVLA実行を示す。
完全なデータとベンチマーク（2000万サンプル以上；RynnBrain-Bench）は、具現知能の拡張可能で再現性の高い開発を支える。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。