QUICK REVIEW

[论文解读] Can Large Language Models Play Text Games Well? Current State-of-the-Art and Open Questions

Chen Feng Tsai, Xiaochen Zhou|arXiv (Cornell University)|Apr 6, 2023

Topic Modeling被引用 9

一句话总结

该论文探究了 ChatGPT 与其他大型语言模型在文本为基础的游戏（如 Zork）中的表现。结果显示，ChatGPT 落后于专门代理，缺乏学习到的世界模型与目标推断能力，虽然在一定程度上受人类引导的提示有帮助，但仍远未达到最先进水平。

ABSTRACT

Large language models (LLMs) such as ChatGPT and GPT-4 have recently demonstrated their remarkable abilities of communicating with human users. In this technical report, we take an initiative to investigate their capacities of playing text games, in which a player has to understand the environment and respond to situations by having dialogues with the game world. Our experiments show that ChatGPT performs competitively compared to all the existing systems but still exhibits a low level of intelligence. Precisely, ChatGPT can not construct the world model by playing the game or even reading the game manual; it may fail to leverage the world knowledge that it already has; it cannot infer the goal of each step as the game progresses. Our results open up new research questions at the intersection of artificial intelligence, machine learning, and natural language processing.

研究动机与目标

以游戏作为评估 AI 能力的微观世界模型与目标推断的切入口，提供动机。
评估 ChatGPT 是否能通过阅读游戏攻略并与文字游戏互动来学习世界模型。
在 Zork 中评估 ChatGPT 的导航、类似 SLAM 的推理以及目标推断能力。
在多种提示协议下，将 ChatGPT 与最先进文本游戏代理进行基准比较。

提出的方法

基于 Jericho 的 Zork I 实现，让具有人类参与的 ChatGPT 来进行游戏。
向 ChatGPT 提供当前游戏状态并请求可行动作，然后将选择的行动反馈给游戏。
通过呈现正确的 walkthrough 并查询位置/目的地结果来测试世界模型的学习。
通过让 ChatGPT 从地点对预测目的地来评估类似 SLAM 的导航。
通过询问在游戏进展和观测下的下一个高层目标来评估目标推断。
将 ChatGPT 的表现与训练过的文本游戏代理（DRRN、KG-A2C、RC-DQN）以及未训练的 NAIL 基线进行比较。

Figure 1: We drew this map after reading the first 70 steps of the correct walkthrough.

实验结果

研究问题

RQ1在像 ChatGPT 这样的 LLM 上，是否能在文本游戏中构建或推断出可用的世界模型？
RQ2ChatGPT 能否推断出指导行动的高层目标，而不仅仅是提出单步动作？
RQ3相较于经过训练的代理，ChatGPT 在需要理解环境结构的导航与映射任务（类似 SLAM 的推理）中的表现如何？
RQ4在标准提示下的 Zork 评估中，ChatGPT 相对于最先进文本游戏代理的相对表现如何？

主要发现

Model	Score
ChatGPT	10.0
ChatGPT (+ prev action)	15.0
ChatGPT with intervention	35.0
+ prev action	40.0
NAIL	10.3
DRRN	32.6
KG-A2C	38.8
RC-DQN	34.0

ChatGPT 在单步目的地问题上总体准确率为 55.4%，已看地图 75.0%，未见地图 29.1%；两步准确率为 31.3%（已看 50.0%，未见 10.0%），总体为 42.5%。
ChatGPT 的 SLAM 问题的一步准确率 57.7%，两步 22.8%，总体 39.4%，在已看地图上表现优于未看地图。
ChatGPT 趋向将低级动作当作目标来推断高层次目标，70 步中只有 17 步产生有意义的目标推断。
在未经过训练的 Zork 评估中，ChatGPT 得分 10.0，带有 prior 行为记忆 15.0；在扩展协议中，介入与记忆达到 40.0，但仍落后于最先进代理。
与在 Zork 上训练过的最先进系统（DRRN、KG-A2C、RC-DQN）相比，ChatGPT 表现不及最佳；表中报道的最好 SOTA 得分为 KG-A2C 38.8、RC-DQN 34.0、DRRN 32.6，而 ChatGPT 在修改协议下最高可达 40.0。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。