QUICK REVIEW

[论文解读] Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing

Świechowski, Maciej, Adam Żychowski|arXiv (Cornell University)|Feb 22, 2026

Topic Modeling被引用 0

一句话总结

本论文在通用游戏玩法框架内评估四种大型语言模型在前向仿真与基于规则的推理任务中的表现，分析游戏结构与混淆对性能的影响，并识别常见的推理错误与局限。结果显示在一步推理方面有明确进展，但在更长时间步长与更复杂任务上性能显著下降。

ABSTRACT

This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps). Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors. Overall, the paper reports clear progress in formal reasoning capabilities of contemporary models.

研究动机与目标

将 General Game Playing (GGP) 框架改造为评估 LLM 符号推理的基准。
在多样化游戏中评估四种当代 LLM 对前向仿真与规则解释任务的表现。
分析游戏结构、组合复杂性与语义基础如何与推理准确性相关。
研究语义混淆对 LLM 推理的影响，以将符号能力与语言先验知识区分开来。
在基于逻辑的推理基准中识别常见的 LLM 错误。

提出的方法

使用基于 GDL 的游戏描述提出四个任务：下一个状态生成、可行动作生成、多步状态生成，以及多步动作-状态生成。
对四种模型（Gemini 2.5 Pro、Gemini 2.5 Flash、Llama 3.3 70B、GPT-OSS 120B）在 35 个 GGP 游戏上进行评估。
将输出表示为 GDL 事实集合，并用 Jaccard 指数与严格成功度（%S）来衡量。
将原始具语义意义的描述与混淆变体（占位词、字典词、随机字符串）进行对比，以评估语义基础。
分析跨时域的结果，并进行定性错误分析以识别常见失败模式（幻觉化的规则、冗余事实、约束违规）。
进行游戏结构特征（如规则深度、下一条规则数量）与模型性能之间的相关性分析。

实验结果

研究问题

RQ1LLMs 能否在没有外部求解器的情况下可靠地模拟形式化指定游戏的符号动态？
RQ2问题结构与组合深度如何影响单步与多步任务中 LLM 推理的准确性？
RQ3在 GGP 任务中，语义基础与表面语言线索对 LLM 推理有何影响？
RQ4LLMs 在基于逻辑的前向仿真中常见的失败模式是什么，它们如何随时间步长增加而放大？
RQ5更大或更专业的模型（如 Gemini 变体）在混淆与非混淆描述下是否表现出稳健性？

主要发现

Gemini 2.5 Pro 在任务上通常具有最高的平均性能，随着评估时域的扩展性能显著下降。
下一个状态生成是最简单的任务；强模型的 Jaccard 指数平均超过 0.8，Gemini 2.5 Pro 在超过 95% 的案例中达到完全正确的后继状态（平均），在 34/35 的游戏中 %S≥0.85。
可行动作生成更难；严格成功率（%S）在 Jaccard 指数仍高时常下降，表明难以产生完整且精确的可行动作集合。
多步状态生成难度显著增大；Gemini 2.5 Pro 的平均 JI ≈ 0.865、%S ≈ 0.734，而其他模型下降更明显，显示在多步中误差传播。
多步动作-状态生成是最具挑战性的；即使是 Gemini 2.5 Pro 也表现下降（平均 JI ≈ 0.808、%S ≈ 0.653），而 Llama 3.3 70B 表现较差。
混淆对各模型的性能有抑制作用，随机字符串混淆相对比字典词或占位词变体表现出更高的鲁棒性，提示符号推理对表面语言变更具有一定鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。