[論文レビュー] Multi-Game Decision Transformers
オフラインで学習した単一のトランスフォーマーベースのモデルが最大46のAtariゲームを人間に近い性能でプレイでき、モデルサイズが大きくなるにつれて拡張し、ファインチューニングによる新規ゲームへの迅速な適応を示す。専門家の行動推論とオフライン学習は複数のベースラインを上回る。
A longstanding goal of the field of AI is a method for learning a highly capable, generalist agent from diverse experience. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.
研究の動機と目的
- One model with one set of weights can act across diverse Atari environments using offline data.
- Investigate scaling trends in performance as model size increases in a multi-environment setting.
- Evaluate rapid transfer/fine-tuning to novel games and compare against online/offline baselines.
- Propose and evaluate expert-action inference to generate high-quality actions during inference.
- Release pre-trained models and code to foster research on generalist RL agents.
提案手法
- Formulate reinforcement learning as offline sequence modeling using a decoder-style transformer to predict the next token in a sequence consisting of observations, returns, actions, and rewards.
- Tokenize actions, rewards, and returns into discrete tokens; use image patching to represent observations and add trainable position encodings.
- Train a single Multi-Game Decision Transformer on offline Atari trajectories (41 games, 4.1B steps, ~160B tokens) containing expert and non-expert behavior.
- Implement expert action inference at inference time via a binary expert classifier and a Bayes-like sampling of high-return targets to guide action selection.
- Compare multiple baselines (BC, C51 DQN, CQL offline TD, CPC, BERT, ACL) and ablations to assess multi-game performance and transfer.
- Evaluate scaling effects across model sizes (e.g., DT-10M, DT-40M, DT-200M), and assess fine-tuning on novel games.
実験結果
リサーチクエスチョン
- RQ1Can a single transformer with shared weights learn to act across multiple, diverse Atari games using offline data?
- RQ2Do scaling laws observed in language/vision hold for multi-game reinforcement learning with transformers?
- RQ3How does offline decision transformer compare to online RL and other offline baselines in a multi-environment setting?
- RQ4Is rapid transfer to new games possible via fine-tuning, and how does pretraining affect transfer performance?
- RQ5Does guiding action generation with expert-level inference improve performance over standard behavioral cloning?
主な発見
- A single offline-trained transformer achieves 126% of human-level performance averaged across all 41 training games.
- Performance scales with model size across training games, with larger models training faster and achieving higher in-game scores.
- Multi-Game DT generally outperforms non-transformer offline methods and online multi-game baselines, though single-game specialists remain strongest.
- Pretraining DT on 41 games and fine-tuning on held-out games yields best transfer, outperforming CQL and representation-learning baselines like CPC/BERT/ACL.
- Expert-action inference (optimality-conditioned sampling) significantly improves DT over standard behavioral cloning on most games.
- Training on a mix of expert and non-expert data outperforms expert-only training for the DT, and DT with full data beats BC trained on expert data.
- DT-based methods show improved top-rollout performance over the best demonstration in several games, indicating learning beyond the provided demonstrations.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。