QUICK REVIEW

[논문 리뷰] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

Shengran Hu, Jeff Clune|arXiv (Cornell University)|2023. 06. 01.

Explainable Artificial Intelligence (XAI)인용 수 9

한 줄 요약

Thought Cloning은 행동 중 인간의 사고를 모방하여 에이전트가 언어로 사고하도록 학습시켜, 빠른 학습, 더 나은 일반화, 그리고 Behavioral Cloning에 비해 안전성과 해석 가능성을 향상시킵니다.

ABSTRACT

Language is often considered a key aspect of human thinking, providing us with exceptional abilities to generalize, explore, plan, replan, and adapt to new situations. However, Reinforcement Learning (RL) agents are far from human-level performance in any of these abilities. We hypothesize one reason for such cognitive deficiencies is that they lack the benefits of thinking in language and that we can improve AI agents by training them to think like humans do. We introduce a novel Imitation Learning framework, Thought Cloning, where the idea is to not just clone the behaviors of human demonstrators, but also the thoughts humans have as they perform these behaviors. While we expect Thought Cloning to truly shine at scale on internet-sized datasets of humans thinking out loud while acting (e.g. online videos with transcripts), here we conduct experiments in a domain where the thinking and action data are synthetically generated. Results reveal that Thought Cloning learns much faster than Behavioral Cloning and its performance advantage grows the further out of distribution test tasks are, highlighting its ability to better handle novel situations. Thought Cloning also provides important benefits for AI Safety and Interpretability, and makes it easier to debug and improve AI. Because we can observe the agent's thoughts, we can (1) more easily diagnose why things are going wrong, making it easier to fix the problem, (2) steer the agent by correcting its thinking, or (3) prevent it from doing unsafe things it plans to do. Overall, by training agents how to think as well as behave, Thought Cloning creates safer, more powerful agents.

연구 동기 및 목표

RL 에이전트에서 샘플 효율성, 일반화, 계획 및 재계획을 개선하기 위해 언어 유사 사고의 사용을 고무한다.
Thought Cloning을 제안한다. 이는 동기화된 사고-행동 시현에서 생각(자연어로)과 행동을 함께 학습하는 모방 학습 프레임워크인 Thought Cloning을 제안한다.
BabyAI의 합성 사고 데이터 세트에서 Thought Cloning이 Behavioral Cloning보다 우수하다는 것을 시연하며, 분포 외 일반화 및 안전성 측면에서 더 강한 이점을 보인다.

제안 방법

바이레벨 Thought Generator와 Action Generator 아키텍처를 도입한다.
미션, 관찰, 사고 히스토리에 조건화된 사고 예측 손실(Thought Cloning loss)과 행동 예측 손실(Action loss)을 결합한 손실로 학습한다.
실제 데이터를 시뮬레이션하기 위한 노이즈를 추가하고, 사고 기록이 행동 및 미션에 정렬된 BabyAI에서 파생된 합성 사고 데이터 세트를 사용한다.
FiLM 융합이 포함된 메모리 증강 LSTM을 사용하여 Thought Generator를 구현한다; 확장을 위해 사전 학습된 Vision-Language 모델 활용을 선택적으로 고려한다.
Thinking 감독의 이점을 분리하기 위해 Thought Cloning을 Behavioral Cloning과 사고 모방 손실이 없는 TC 변형과 비교한다.
1M-trajectory BabyAI 데이터에서 평가하고, 8개의 학습 에폭과 교사 강제 스케줄을 통해 점진적으로 자기 회귀형 사고 생성을 향해 이동한다.

Figure 1 : Overall framework for Thought Cloning (TC). The TC agent has two components: the Thought Generator and Action Generator . At each timestep, the TC agent receives an observation, a mission, and a history of thoughts as inputs. The Thought Generator generates thoughts, and the Action Genera

실험 결과

연구 질문

RQ1인간 생각 시연으로부터 사고와 행동을 학습하는 이중 모델 모방 프레이크가 전통적인 Behavioral Cloning보다 성과를 낼 수 있는가?
RQ2Thought Cloning으로 학습된 에이전트가 분포 외 환경에 더 잘 일반화하고, 미세 조정을 통해 적응할 수 있는가?
RQ3해석 가능성과 안전성 이점(예: 사고에 대한 개입 가능성)이 실제로 지속되는가?
RQ4인간과 유사한 사고를 도입하면 도전적이고 부분적으로 관측 가능한 도메인에서 더 빠른 학습과 계획/재계획이 가능해지는가?

주요 결과

Thought Cloning은 Behavioral Cloning보다 더 빨리 학습하고 훈련 중에도 우수한 성능을 유지한다.
Thought Cloning은 사고 모방 손실이 없는 TC 변형보다 성능이 우수하며, 이 이점이 더 많은 매개변수 때문만은 아님을 보여준다.
Thought Cloning은 zero-shot 및 미세 조정 시나리오에서 분포 외 환경에 더 잘 일반화한다.
이 방법은 해석 가능성 지표(Future Action Declaration Score)를 제공하고, 안전하지 않은 계획을 차단하기 위한 Precrime Intervention을 가능하게 한다.
오라클 수준의 고수준 사고를 사용하면 Thought Cloning이 대부분의 환경에서 거의 최적의 성능에 도달한다.
결과는 대규모 인간 사고 데이터로 Thought Cloning을 확장하면 능력과 안전성을 크게 향상시킬 수 있음을 시사한다.

Figure 2 : Left : A BabyAI [ 26 ] environment example. The environment contains various colored items ( ball, key, box, door ). The agent can pick up, drop, and move objects or open and close doors, while locked doors can only be unlocked with color-matched keys. The agent can observe the $7\times 7

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.