QUICK REVIEW

[论文解读] Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners

Xiaojuan Tang, Zilong Zheng|arXiv (Cornell University)|May 24, 2023

Topic Modeling被引用 15

一句话总结

该论文将语义与在上下文中的推理在大语言模型中解耦，显示语义表征驱动了大部分推理，在许多任务中优于纯符号线索，并分析记忆与推理在演绎、归纳和溯因中的表现。

ABSTRACT

The emergent few-shot reasoning capabilities of Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned extit{semantics} of language tokens do the most heavy lifting during the reasoning process. Different from human's symbolic reasoning process, the semantic representations of LLMs could create strong connections among tokens, thus composing a superficial logical chain. To test our hypothesis, we decouple semantics from the language reasoning process and evaluate three kinds of reasoning abilities, i.e., deduction, induction and abduction. Our findings reveal that semantics play a vital role in LLMs' in-context reasoning -- LLMs perform significantly better when semantics are consistent with commonsense but struggle to solve symbolic or counter-commonsense reasoning tasks by leveraging in-context new knowledge. The surprising observations question whether modern LLMs have mastered the inductive, deductive and abductive reasoning abilities as in human intelligence, and motivate research on unveiling the magic existing within the black-box LLMs. On the whole, our analysis provides a novel perspective on the role of semantics in developing and evaluating language models' reasoning abilities. Code is available at {\url{https://github.com/XiaojuanTang/ICSR}}.

研究动机与目标

在通过将语义内容与推理提示解耦来在上下文中推理，研究LLMs是否可以在无语义的情况下进行推理。
在受控符号设置中评估三种推理类型——演绎、归纳和溯因。
在面对语义信息与符号信息时评估LLMs的记忆与知识更新行为。
考察常识知识及表示（自然语言 vs 逻辑语言）如何影响LLMs的上下文推理。

提出的方法

提出一个合成的 Symbolic Tree 数据集，具有封闭世界、无噪声的符号推理，以及在开放世界假设下的 ProofWriter 子集，用来测试无语义推理。
通过将谓词替换为符号标签（如 r1、r2）和将实体替换为 ID 来实现语义解耦；与 Semantics（自然语言谓词）设置进行比较。
在演绎、归纳和溯因任务上评估 ChatGPT、GPT-4 和 LLaMA-7B；使用基于逻辑的基线和 Neo4j 进行记忆比较。
在记忆任务上对 LLaMA-7B 进行微调，并比较内部记忆与外部知识库。
分析条件效应：移除规则/事实、引入反常识标签，并测试 ProofWriter OWL 任务以研究语义影响。
探讨上下文长度和表示形式（自然语言 vs 逻辑语言）对推理性能的影响；评估零-shot 与 Chain-of-Thought（CoT）效应，以及内部与外部知识的使用。

Figure 1: Task Definitions. Memorization : retrieving the predicted fact from in-context knowledge. Deductive : predicting the correctness of the predicted fact given rules and facts. Inductive : generating a rule based on multiple facts with similar patterns. Abductive : explaining the predicted fa

实验结果

研究问题

RQ1在对符号任务进行上下文推理时，LLMs 是否依赖语义还是内化的先验知识？
RQ2演绎、归纳和溯因任务在对语义 vs 符号表示的敏感性方面有何差异？
RQ3在解耦语义的条件下，常识知识与记忆在LLM推理中的作用是什么？
RQ4表示形式与提示策略（零-shot vs CoT）如何影响上下文推理的性能？

主要发现

类别	模型	基线	演绎	归纳	溯因
Symbols	ChatGPT	Zero-Shot	52.6	6.10	1.50
Symbols	ChatGPT	Zero-Shot-CoT	55.7	7.86	4.90
Symbols	ChatGPT	Few-Shot-CoT	54.8	-	18.2
Symbols	ChatGPT	Zero-Plus-Few-Shot-CoT	55.7	-	-
Symbols	GPT-4	Zero-Shot	68.8	9.28	25.0
Symbols	GPT-4	Zero-Shot-CoT	71.1	8.93	31.2
Symbols	GPT-4	Few-Shot-CoT	67.6	-	44.2
Symbols	GPT-4	Zero-Plus-Few-Shot-CoT	67.2	-	-
Semantics	ChatGPT	Zero-Shot	66.1	36.4	2.94
Semantics	ChatGPT	Zero-Shot-CoT	65.5	32.2	3.40
Semantics	ChatGPT	Few-Shot-CoT	67.1	-	21.8
Semantics	ChatGPT	Zero-Plus-Few-Shot-CoT	67.2	-	-
Semantics	GPT-4	Zero-Shot	79.2	52.5	27.3
Semantics	GPT-4	Zero-Shot-CoT	86.2	53.9	33.4
Semantics	GPT-4	Few-Shot-CoT	91.1	-	69.2
Random	-	-	-	-	-
Logic-based	-	-	57.1	100	100

以语义驱动的设定在 Symbolic Tree 上显著提升演绎和归纳推理，相较于纯符号设定。
GPT-4 通常优于 ChatGPT，在各评估设定中，语义有助于性能，而符号表示有时会降低归纳收益。
在某些符号逻辑任务上，符号基线仍更强，表明符号推理尚未被当前LLMs完全捕捉。
用语义表示记忆新事实更快，但由于更强的因子间相关性，遗忘可能更高。
Zero-shot-CoT 在语义设置中收益有限，在某些解耦语义任务中甚至可能比零-shot更差。
内部知识的使用通常超过对外部上下文规则的依赖在推理任务中的表现。

Figure 2: Decoupling semantics from the ProofWriter task. In the original ProofWriter task, entities are represented by their names (left). However, in our decoupled setting, we replace the entity names with unique entity IDs (right).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。