QUICK REVIEW

[論文レビュー] Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

Berry Gerrits|arXiv (Cornell University)|Jan 27, 2026

AI in Service Interactions被引用数 0

ひとこと要約

この論文は現代の大規模言語モデル（LLM）チャットボット（ChatGPT、Claude、Gemini）をゲーム特化の訓練なしにZork Iで評価し、平均で完走率が10％未満であり、堅牢なメタ認知や学習を示せないことを示す。

ABSTRACT

In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

研究の動機と目的

現代のLLMが、タスク固有のファインチューニングなしで複雑なテキストベースのパズルを解けるかを評価する。
プロンプトの詳細さといわゆる「思考」機能が性能にどのような影響を与えるかを検証する。
戦略の継続性や過去の試行からの学習といったメタ認知的側面を調査する。

提案手法

3つの提供元からの6つのLLMベースのチャットボットを、2つのプロンプト条件（基本と高度）で使用する。
Pythonスクリプトを介してZork IにLLMsを接続し、ゲーム出力をフィードしてコマンドを取得、モデルが完全な会話履歴にアクセスできるようにする。
モデル-プロンプトの各組み合わせにつき5回の独立実行を評価し、合計40回の実行、1回あたりの最大移動数は500。
分析のために移動、得点（最大350点）、会話ログを記録する。

実験結果

リサーチクエスチョン

RQ1最先端のLLMは、ゲーム固有の訓練なしに長文のテキストのみのアドベンチャーゲームを意味のある完成まで到達できるか。
RQ2詳細なゲーム知識を提供したり「思考」モードを有効にすることは、性能を改善するか、メタ認知的な計画を反映するか。
RQ3LLMsはZork Iにおいて適応的推論、記憶の使用、試行を重ねる学習を示すか。
RQ4プレイ中に本当に理解しているかどうかを示す定性的パターンは何か。
RQ5性能差は、単なる事実記憶よりもメタ認知的制約に起因するのか。

主な発見

全モデルとも平均してゲームの完走率が10％未満（約10％程度）、Claude Opus 4.5が最も高い約75/350点（約20％）を達成。
詳細なゲーム指示の提供や拡張思考を有効にしても、いずれのモデルにおいても性能は向上しなかった。
基本プロンプトと高度プロンプト、また思考有効化と非思考設定の間に providerを跨いだ有意な差はほとんどなかった。
定性的分析では、反省のない反復的な行動と会話履歴からの学習の欠如が示され、メタ認知と計画の限界が示唆される。
ChatGPTは長い同種のコマンドの連続で行き詰まる傾向があり、自己修正や戦略適応が最小限であることを示した。
Claudeのみが、取り返しのつかないループに陥った際にExplicitに「I give up」と発することがあった。
総じて、LLMのメタ認知とテキストベースのドメイン特有の問題解決能力には substantialな制約があると示唆される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。