QUICK REVIEW

[論文レビュー] On the Utility of Learning about Humans for Human-AI Coordination

Micah Carroll, Rohin Shah|arXiv (Cornell University)|Oct 13, 2019

Reinforcement Learning in Robotics被引用数 91

ひとこと要約

自己対戦エージェントは他のAIとは協調的だが人間にはうまく協調できず、人間のデータやモデルを用いた訓練は人間とAIの協調を向上させることを示し、Overcookedをベースとした環境で検証され、実際の人間のユーザー研究によって確認された。

ABSTRACT

While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves. Agents that assume their partner to be optimal or similar to them can converge to coordination protocols that fail to understand and be understood by humans. To demonstrate this, we introduce a simple environment that requires challenging coordination, based on the popular game Overcooked, and learn a simple model that mimics human play. We evaluate the performance of agents trained via self-play and population-based training. These agents perform very well when paired with themselves, but when paired with our human model, they are significantly worse than agents designed to play with the human model. An experiment with a planning algorithm yields the same conclusion, though only when the human-aware planner is given the exact human model that it is playing with. A user study with real humans shows this pattern as well, though less strongly. Qualitatively, we find that the gains come from having the agent adapt to the human's gameplay. Given this result, we suggest several approaches for designing agents that learn about humans in order to better coordinate with them. Code is available at https://github.com/HumanCompatibleAI/overcooked_ai.

研究の動機と目的

人間と協働する際の自己対戦の失敗に対処するため、AIシステムにおける人間を意識した協調の必要性を動機づける。
難易度の高い協調の下で人間-AIの協働をテストするOvercooked風の環境を導入する。
人間との協働を目的として、自己対戦、集団ベース訓練、計画、および人間モデルベース訓練を評価する。
人間モデルを組み込むことが、シミュレートされた相手と実際の人間パートナーの双方で性能を向上させることを示す。

提案手法

玉ねぎ、料理、スープを用いたOvercooked風の多-agent環境を開発し、協調課題を生み出す。
各レイアウトごとに人間同士の軌跡を収集し、簡易な挙動クローン人間モデル（BC）を訓練する。
自己対戦（SP）、集団ベース訓練（PBT）、結合計画（CP）で訓練されたエージェントを、人間モデル（PPO BC、BCによる計画）で訓練されたエージェントと比較する。
保持外れの代理人間モデル H_Proxy に対して、そして実際の人間を対象としたユーザ研究でエージェントを評価する。
金標準のベースラインとして、代理人間モデルへ直接アクセスして訓練し、達成可能な性能を上限づける。

実験結果

リサーチクエスチョン

RQ1自己対戦で訓練された協調は、非最適な人間モデルや実在人間と組み合わせると劣化するか。
RQ2挙動クローンまたは計画による人間モデルを訓練に組み込むと、自己対戦のみと比較して人間-AIの協調が改善されるか。
RQ3人間パートナーと協働する際、計画ベースと強化学習ベースのアプローチはどう比較されるか。
RQ4シミュレートされた代理人間を用いた知見は、実際の人間ユーザーに一般化するか。

主な発見

自己対戦およびPBTエージェントは自分自身とは良好に機能するが、代理人間モデルや実在人間と組み合わせると著しく劣る。
挙動クローンされた人間モデル（PPO BC）で訓練したエージェントは、人間と組み合わせた場合に自己対戦エージェントを上回り、可能な場合には金標準の性能に近づく。
真の人間モデルへのアクセスを伴う計画は有効だが、BCモデルでの計画は人間モデルが不正確な場合、ループや性能低下につながることがある。
模倣ベースの人間モデルは、 humans が最適だとかエージェントと同じだと仮定するよりも協調を改善する。人間モデルを用いた計画/RLは、通常、素の模倣を上回る。
ユーザ研究では、PPO BC が複数のレイアウトで一般に SP および PBT を上回るが、効果はタスクレイアウトとモデル品質によって異なる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。