QUICK REVIEW

[論文レビュー] Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion

Zipeng Fu, Xuxin Cheng|arXiv (Cornell University)|Oct 18, 2022

Muscle activation and electromyography studies被引用数 25

ひとこと要約

著者らは、脚部四足歩行と取り付けられたアームを同時に操作・移動させる統一ポリシーを学習し、Sim-to-Realを橋渡しする Regularized Online Adaptation モジュールと訓練を速める Advantage Mixing を用いる。

ABSTRACT

An attached arm can significantly increase the applicability of legged robots to several mobile manipulation tasks that are not possible for the wheeled or tracked counterparts. The standard hierarchical control pipeline for such legged manipulators is to decouple the controller into that of manipulation and locomotion. However, this is ineffective. It requires immense engineering to support coordination between the arm and legs, and error can propagate across modules causing non-smooth unnatural motions. It is also biological implausible given evidence for strong motor synergies across limbs. In this work, we propose to learn a unified policy for whole-body control of a legged manipulator using reinforcement learning. We propose Regularized Online Adaptation to bridge the Sim2Real gap for high-DoF control, and Advantage Mixing exploiting the causal dependency in the action space to overcome local minima during training the whole-body system. We also present a simple design for a low-cost legged manipulator, and find that our unified policy can demonstrate dynamic and agile behaviors across several task setups. Videos are at https://maniploco.github.io

研究の動機と目的

脚型ロボットでのモバイルマニピュレーションを可能にするため、腕と脚の制御を密接に協調させることを目的とする。
操作と移動を統合する単一のエンドツーエンドポリシーを開発する。
2段階のTeacher-Student設定を用いずに、sim-to-real転送に対処する。
低コストなハードウェアプラットフォームと多様なタスク設定を通じた堅牢な学習を示す。

提案手法

単一のニューラルポリシー pi を定式化し、base、arm、leg の状態と前回の actions、環境 extrinsics を入力として受け取り、arm および leg のターゲット関節位置を出力する。
PPOを用いて操作と移動を組み合わせた報酬で強化学習を行う。
Policy更新時に manipulation と locomotion の advantages を混合してクレジット割り当てを分解するための Advantage Mixing を導入する。
Sim-to-Real を橋渡しするために、privileged simulation data から environment extrinsics latent z_mu を学習し、それを onboard observations から推定される z_phi に向けて正則化する Regularized Online Adaptation を提案する。
腕と脚に PD トルクを用いたジョイント空間位置制御を使用することで、学習を簡素化し、Sim-to-Real ギャップを縮小する。
現実世界での評価のため、低コストで無 tethered なハードウェアプラットフォーム（Go1 quadruped と WidowX arm）を提供する。

実験結果

リサーチクエスチョン

RQ1単一の統一ポリシーは、デカップルドまたは部分的に結合した制御器よりも、脚歩行とアーム操作をより効果的に協調できるのか。
RQ2Advantage Mixing は、同時の操作と移動に対する学習を加速し、指令追従性を改善するのか。
RQ3Regularized Online Adaptation は、2段階の teacher-student パイプラインなしで頑健な sim-to-real 転送を提供できるのか？

主な発見

手法	生存率	ベース加速	速度誤差	EE 誤差	総エネルギー
統一（当方）	97.1±0.61	1.00±0.03	0.31±0.03	0.63±0.02	50±0.90
分離	92.0±0.90	1.40±0.04	0.43±0.07	0.92±0.10	51±0.30
非協調	94.9±0.61	1.03±0.01	0.33±0.01	0.73±0.02	50±0.28

統一ポリシーは、複数の指標で分離型・非協調なベースラインを上回り、存続率が高く、エネルギー使用は同等かそれ以下である。
Advantage Mixing は、操作と移動の両方に対する学習を加速し、指令追従性を改善して収束時間を短縮する。
Regularized Online Adaptation は、Rapid Motor Adaptation および Domain Randomization よりも良い sim-to-real 転送をもたらし、模倣誤差が小さく EE トラッキングが改善される。
統一ポリシーはアームの作業領域を拡大し、撹乱下での安定性を向上させ、脚とアームの全身協調が強固であることを示している。
実機実験では、機敏で協調的な脚-アーム運動と、ベースラインの MPC+IK コントローラと比べてタスク成功率と速度が優れることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。