QUICK REVIEW

[논문 리뷰] Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

Berry Gerrits|arXiv (Cornell University)|2026. 01. 27.

AI in Service Interactions인용 수 0

한 줄 요약

이 논문은 현재의 LLM 채팅봇(ChatGPT, Claude, Gemini)을 Zork I에서 게임-specific training 없이 평가하고, 평균 완성도 10% 미만이며 강건한 메타 인지나 학습 능력을 보여주지 못한다는 것을 발견한다.

ABSTRACT

In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

연구 동기 및 목표

현대 LLM이 게임 특화 파인튜닝 없이 복잡한 텍스트 기반 퍼즐을 해결할 수 있는지 평가한다.
프롬프트의 상세도와 소위 '생각' 기능의 성능에 대한 역할을 조사한다.
전략 지속성 및 이전 시도에서의 학습과 같은 메타인지 측면을 조사한다.

제안 방법

세 가지 공급자(Claude Opus/Sonnet, ChatGPT, Gemini)의 여섯 개의 LLM 기반 채팅봇을 두 가지 프롬프트 조건(기본 및 고급)에서 사용한다.
Python 스크립트를 통해 LLM을 Zork I에 연결하여 게임 출력을 피드하고 명령을 캡처하며 모델이 전체 대화 기록에 접근 가능하도록 한다.
모델-프롬프트 쌍당 다섯 번의 독립 실행을 평가하여 합계 40회 실행, 각 실행당 500move 제한.
분석을 위해 이동 수, 점수(최대 350점), 대화 로그를 기록한다.

실험 결과

연구 질문

RQ1최신 LLM이 게임 특화 학습 없이 긴 서사형 텍스트 모험 게임에서 의미 있는 완성을 달성할 수 있는가?
RQ2자세한 게임 지식을 제공하거나 '생각' 모드를 가능하게 하는 것이 성능을 향상시키거나 메타인지 계획을 반영하는가?
RQ3LLMs가 Zork I에서 시도 간에 적응적 추론, 기억 사용 및 학습을 보여주는가?
RQ4게임 플레이 중 LLM의 진정한 이해의 존재 여부에 대해 어떤 질적 패턴이 있는가?
RQ5성능 차이가 단순한 사실 회상이 아닌 메타인지적 한계 때문인가?

주요 결과

모든 테스트 모델은 평균적으로 10% 미만의 게임을 완료했으며(대략 10%), Claude Opus 4.5가 약 75/350점(약 20%)으로 최상의 성과를 보였다.
자세한 게임 지시를 제공하거나 확장된 사고를 가능하게 하는 것이 어떤 모델에서도 성능을 향상시키지 못했다.
기본 프롬프트와 고급 프롬프트 간의 차이, 또는 사고-enabled 대 사고 비활성 구성 간의 차이가 공급자 간에 거의 없었다.
정성적 분석은 반복적이고 반성되지 않는 행동과 대화 기록에서의 학습 실패를 보여 주며, 제한된 메타인지 및 계획 수립을 시사한다.
ChatGPT는 비슷한 명령의 긴 시퀀스에 빠지는 경향이 있어 자체 수정이나 전략 적응이 거의 없음을 시사한다.
오직 Claude만이 회복 불가능한 루프에 갇힐 때 명시적으로 'I give up'이라고 하는 경우가 드물지 않게 나타났다.
전반적으로 LLM의 메타인지와 텍스트 기반 문제 해결의 상당한 한계가 나타났다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.