QUICK REVIEW

[论文解读] Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

Weiyu Ma, Qirui Mi|arXiv (Cornell University)|Dec 19, 2023

Topic Modeling被引用 10

一句话总结

本文介绍 TextStarCraft II，一个基于文本的星际争霸 II 环境，以及 Chain of Summarization (CoS) 方法，使大型语言模型能够玩星际争霸 II，证明 GPT3.5-Turbo-16k 与 GPT-4 能以 Protoss 身份击败 Harder (Lv5) Zerg AI 的对手，并提供开源代码和演示。

ABSTRACT

StarCraft II is a challenging benchmark for AI agents due to the necessity of both precise micro level operations and strategic macro awareness. Previous works, such as Alphastar and SCC, achieve impressive performance on tackling StarCraft II , however, still exhibit deficiencies in long term strategic planning and strategy interpretability. Emerging large language model (LLM) agents, such as Voyage and MetaGPT, presents the immense potential in solving intricate tasks. Motivated by this, we aim to validate the capabilities of LLMs on StarCraft II, a highly complex RTS game.To conveniently take full advantage of LLMs` reasoning abilities, we first develop textual StratCraft II environment, called TextStarCraft II, which LLM agent can interact. Secondly, we propose a Chain of Summarization method, including single frame summarization for processing raw observations and multi frame summarization for analyzing game information, providing command recommendations, and generating strategic decisions. Our experiment consists of two parts: first, an evaluation by human experts, which includes assessing the LLMs`s mastery of StarCraft II knowledge and the performance of LLM agents in the game; second, the in game performance of LLM agents, encompassing aspects like win rate and the impact of Chain of Summarization.Experiment results demonstrate that: 1. LLMs possess the relevant knowledge and complex planning abilities needed to address StarCraft II scenarios; 2. Human experts consider the performance of LLM agents to be close to that of an average player who has played StarCraft II for eight years; 3. LLM agents are capable of defeating the built in AI at the Harder(Lv5) difficulty level. We have open sourced the code and released demo videos of LLM agent playing StarCraft II.

研究动机与目标

为大语言模型交互开发一个基于文本的星际争霸 II 环境（TextStarCraft II）。
提出 Chain of Summarization，以实现多帧推理和快速决策。
通过专家评估和游戏内评估来评估大语言模型的知识和策略能力。
比较多种大语言模型和提示，并分析提示的影响及消融分析。
提供可重复性的开源代码和演示视频。

提出的方法

构建一个基于文本的星际争霸 II 界面（TextStarCraft II），基于 python-sc2，配备观测文本化和行动文本化适配器。
将 TextStarCraft II 表述为一个具有文本观测和宏动作的多智能体部分可观测马尔可夫决策过程（POMDP）。
提出 Chain of Summarization（CoS），包括单帧摘要、多帧摘要和用于行动队列的动作提取。
使用两阶段交互频率（K 帧周期）来平衡大语言模型推理成本与游戏速度。
在 CoS 下评估不同的 LLM（GPT3.5-Turbo、GPT3.5-Turbo-16k、GPT-4、Finetune-ChatGlm2 6b、Finetune-Llama2 7b），结合人工与游戏内评估。
进行消融研究以评估 CoS 和提示的影响。

实验结果

研究问题

RQ1具备 CoS 的 LLM 是否具备应对星际争霸 II 场景所需的领域知识与规划能力？
RQ2Chain of Summarization 是否能实现有效的长期战略决策与更快的交互？
RQ3TextStarCraft II 中不同 LLM（及提示）的对局内性能差异？
RQ4提示如何影响策略形成与对抗内置 AI 的胜率？
RQ5人类专家评估是否与基于 LLM 的游戏内表现一致？

主要发现

方法	API 调用次数	时间成本（小时）
With Chain of Summarization	700	7
Without Chain of Summarization	7,000	70

LLMs 在 TextStarCraft II 场景中具备相关的星际争霸 II 知识和复杂的规划能力。
人类专家将 LLM 代理的表现评为接近拥有八年星际争霸 II 经验的平均玩家。
LLM 代理在以 Protoss 身份作战时，可以击败难度为 Harder（Lv5）的内置 AI。
与非 CoS 设置相比，CoS 能加速决策并减少 API 调用次数。
GPT3.5-Turbo-16k 和 GPT-4 展现出在 CoS 下的高级推理能力并能击败 Harder AI；GPT-4 未超过 Very Hard；微调模型（ChatGLM2 6b、Llama2 7b）在有效作战方面存在困难。
提示设计显著影响策略形成与胜率，Prompt2 超过 Prompt1。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。