QUICK REVIEW

[論文レビュー] AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

Y. Takagi, Motonari Kambara|arXiv (Cornell University)|Mar 16, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

AnoleVLAは、深層状態空間モデル（Mamba）バックボーンを用いた軽量なVision-Language-Actionモデルを導入し、言語ガイド付きモバイル操作の連続的なアクション軌道を効率的に生成。強力なベースラインより現実世界での成功率が高く、推論も高速。

ABSTRACT

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.

研究の動機と目的

リソース制約下での言語ガイド付きロボティック操作に対応。
推論待機時間とVRAM使用量を削減しつつタスクレベルの一般性を維持。
線形時間の系列モデリングで長文脈のマルチモーダル処理を実現。
離散化せず直接連続アクション軌道を生成。
二段階トレーニング戦略による軌道の滑らかさの向上。

提案手法

マルチモーダル系列モデリングのためのTransformerバックボーンを置換する深層状態空間モデル（Mamba）バックボーンを採用。
固有覚（proprioception）、状態差分、視覚、言語を共有潜在空間に埋め込み、入力トークンとして結合。
スタックされたMambaブロックを通じてトークンを処理し最終表現を生成。
最終トークンから線形アクションヘッドで短期的な連続アクションチャンクを直接予測。
二段階損失で訓練：第一段階は速度（アクションのL1）、次に速度＋加速度（時系列差分のL1）。
リソース制約ハードウェア上でリアルタイム推論を達成するために線形時間の系列処理を維持。

実験結果

リサーチクエスチョン

RQ1言語ガイド付き操作において、軽量な深層状態空間モデルベースのVLAは精度と速度の点でTransformerベースのVLAsと同等以上になり得るか。
RQ2加速度損失を含む二段階トレーニングは、シミュレーションと現実世界の両方で軌道の滑らかさとタスク成功を向上させるか。
RQ3アクションを離散化せず直接連続アクションを生成することは、ロボティクスにとって実現可能で有益か。

主な発見

方法	総パラメータ	Meta-World Easy	Meta-World Med	Meta-World Hard	Meta-World V.Hard	Avg.	Move	Pick	Open	Close	Push	Avg	Inf. speed [ms/chunk]
π0.5	3.0 B	68.20	37.30	41.70	28.00	43.80	15	20	30	100	45	42	578
VLA-Adapter	0.5 B	3.75	0.0	0.0	0.0	0.94	35	20	0	100	45	40	101
TinyVLA	0.42 B	77.60	21.50	11.40	15.80	31.58	5	15	0	90	35	29	1290
SmolVLA	0.45 B	82.50	41.80	45.00	60.00	57.33	30	25	55	100	50	52	309
AnoleVLA	0.47 B	89.29	45.45	66.67	70.00	67.85	45	40	75	100	55	63	216

AnoleVLAはMeta-Worldでの平均成功率をベースラインより高く（67.85%）、π0.5: 43.80%、SmolVLA: 57.33%などを上回る。
物理実験ではAnoleVLAは5タスクで平均63%のタスク成功を達成し、制限された計算予算下でSmolVLAなどより優れている。
AnoleVLAの推論は実機でπ0.5より約3倍速く、チャンクあたり216 ms対578 ms。
加速度損失を含む二段階トレーニングは、速度損失のみの場合と比較して平均成功率を4.73ポイント改善。
アブレーションにより加速度損失がより滑らかで安定した軌道とより良い性能に寄与することが示された。
定性的結果として、AnoleVLAは言語条件付きで複雑なツール使用軌道を生成できるが、局所化誤差が一部見られる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。