QUICK REVIEW

[论文解读] Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks

Fan Huang|arXiv (Cornell University)|Mar 21, 2026

Advanced Graph Neural Networks被引用 0

一句话总结

该论文引入 Network-of-Thought (NoT)，一种由自我生成控制器引导的基于图的推理框架，并在多项基准上与 Chain-of-Thought (CoT) 和 Tree-of-Thought (ToT) 进行比较，使用 GPT-4o-mini 与开源模型，结果显示 NoT 在多跳推理和多源信息推理任务中表现出色，而 CoT 在序列任务中仍然最强。

ABSTRACT

Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5\% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0\% vs.\ 88.0\% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5\%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7\% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0\% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14--18 percentage point gap on HotpotQA).

研究动机与目标

形式化推理拓扑的分类（链、树、网络）及其权衡。
提出带有启发式引导控制器的 NoT，在有类型的图结构推理框架中扩展节点。
评估自我生成的控制器权重启发式及其对性能的影响。
在不同基准上评估拓扑的有效性、效率及评估方法对结果的影响。

提出的方法

将推理表示为带有类型节点（事实、子目标、约束、结论）和带有类型边（依赖、支持、派生、矛盾）的有向图。
引入一个控制器，使用不确定性、依赖程度和冲突的权重来对未解决节点进行评分，权重甚至可能由大语言模型本身生成（自我生成的启发式）。
使用三阶段的 NoT 流水线：图初始化、通过对LLM调用的迭代图基扩展、以及由基于LLM的语义评判者评估的答案提取。
在 GSM8K、Game of 24、HotpotQA、ProofWriter 等数据集上，将 NoT 与 CoT 及 ToT 在 GPT-4o-mini、Llama-3.3-70B-Instruct、Qwen2.5-72B-Instruct 上进行对比。
采用两种评估方案（字符串匹配与将LLM作为评判者）来评估准确性，并分析评估方法如何影响拓扑排序。

实验结果

研究问题

RQ1RQ1：在什么情况下需要网络推理拓扑，而不是链式或树状结构？
RQ2RQ2：自我生成的启发式是否能够提升网络推理？
RQ3RQ3：各推理拓扑在计算成本与准确性之间的权衡如何？

主要发现

CoT 仍然在序列任务（如 GSM8K）中表现最佳。
NoT 在多跳推理方面优于 ToT（如 HotpotQA：NoT Judge 91.0% vs ToT Judge 88.0%）。
在 72B 开放模型下，NoT 实现了最高的 GSM8K 准确率（91.5%），而 Qwen2.5-72B-Instruct 在 HotpotQA 上达到多跳问答的最高分（91.7%）。
自我生成的控制器启发式在逻辑推理中优于固定/随机策略（ProofWriter：54.0% 对 51.3% 固定；仅不确定性加权就达到 57.0%）。
评估方法显著影响方法排名：字符串匹配低估 NoT 的表现，特别是在 HotpotQA 上存在 14–18 点的差距。
NoT 图在推理复用和多源信息整合方面具有优势，与 ToT 相比，在中等标记成本下仍能达到竞争性的准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。