QUICK REVIEW

[論文レビュー] Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model

Paul F. Christiano, Zain Shah|arXiv (Cornell University)|Oct 11, 2016

Reinforcement Learning in Robotics参考文献 3被引用数 166

ひとこと要約

論文は、ターゲット領域で深い逆ダイナミクスモデルを学習することにより、シミュレーションで訓練されたポリシーを現実世界へ転移させる方法を提示します。シミュレータを用いて次の観測を予測し、それに応じて行動を適応させます。

ABSTRACT

Developing control policies in simulation is often more practical and safer than directly running experiments in the real world. This applies to policies obtained from planning and optimization, and even more so to policies obtained from reinforcement learning, which is often very data demanding. However, a policy that succeeds in simulation often doesn't work when deployed on a real robot. Nevertheless, often the overall gist of what the policy does in simulation remains valid in the real world. In this paper we investigate such settings, where the sequence of states traversed in simulation remains reasonable for the real world, even if the details of the controls are not, as could be the case when the key differences lie in detailed friction, contact, mass and geometry properties. During execution, at each time step our approach computes what the simulation-based control policy would do, but then, rather than executing these controls on the real robot, our approach computes what the simulation expects the resulting next state(s) will be, and then relies on a learned deep inverse dynamics model to decide which real-world action is most suitable to achieve those next states. Deep models are only as good as their training data, and we also propose an approach for data collection to (incrementally) learn the deep inverse dynamics model. Our experiments shows our approach compares favorably with various baselines that have been developed for dealing with simulation to real world model discrepancy, including output error control and Gaussian dynamics adaptation.

研究の動機と目的

信号源ドメインの有能なポリシーを活用して、シミュレーションと現実のずれにもかかわらずターゲットドメイン（しばしば現実）で良い性能を発揮する。
高レベルのポリシー挙動は転送される一方、摩擦、接触、その他の動力学による低レベルの制御の細部は異なる、という考えを活用する。
ターゲットドメインでの行動を適応させる深い逆ダイナミクスモデルを訓練するオンラインデータ収集戦略を開発する。
Sim1→Sim2 および Sim→Real の実験を通じて転送効果を実証し、接触の多いタスクを含む。
出力誤差制御やガウシアン動力学適応によってモデル不一致に対処するベースラインと比較する。

提案手法

At each time step, compute a source-domain action a_source = pi_source(tau_-k:).
Predict the next source-domain observation o_next_hat = o(T_source(tau_-k:, a_source)).
Use a learned inverse dynamics model phi(tau_-k:, o_next_hat) to select the target-domain action a_target.
Train phi to map (oHistory, aHistory, o_next) to the preceding action that achieves the transition.
Incorporate history window H to capture temporal dependencies and latent factors in dynamics.
Collect training data by executing a preliminary target-domain policy with selective exploration noise and iteratively refining phi.]
研究質問：

実験結果

リサーチクエスチョン

RQ1Can a deep inverse dynamics model learned in the target domain enable effective transfer from a source-domain policy to the target domain?
RQ2Does using predicted next observations and an inverse model outperform direct policy transfer or forward-dynamics adaptation in simulation-to-real transfer, especially with contact-rich dynamics?
RQ3How does history-aware inverse dynamics learning impact data efficiency and adaptation performance?
RQ4What is the comparative performance of the proposed method against output error control and Gaussian dynamics adaptation baselines in varied dynamics?
RQ5Is action adaptation sufficient to achieve robust Sim-to-Real transfer without state/observation adaptation?

主な発見

The proposed method achieves compelling transfer from simulation to real world, including challenging contact-rich dynamics.
Adaptation outperforms baseline methods (output error control and Gaussian dynamics adaptation) in both Sim1→Sim2 and Sim→Real settings.
Using history in the inverse dynamics model reduces data requirements and improves convergence.
Learning with targeted, task-relevant data collection yields faster convergence than random exploration.
In Sim→Real Fetch experiments, the method significantly reduces deviation from the simulated trajectory compared with a PD baseline.
The approach remains effective across variations in gravity and motor noise, and handles discontinuities arising from contacts.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。