Skip to main content
QUICK REVIEW

[论文解读] Recursive Concept Evolution for Compositional Reasoning in Large Language Models

Sarim Chaudhry|arXiv (Cornell University)|Feb 17, 2026
Explainable Artificial Intelligence (XAI)被引用 0
一句话总结

本研究提出 Recursive Concept Evolution (RCE) 框架,使预训练语言模型在推理阶段动态产生、评估并合并低秩概念子空间,以提升组合推理能力,在 ARC-AGI-2、GPQA、MATH、BBH 和 HLE 基准上进行评估。

ABSTRACT

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model's latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.

研究动机与目标

  • Identify architectural limitations of fixed latent geometries in pretrained LLMs for compositional reasoning.
  • Propose RCE to dynamically create, evaluate, and compose low-rank concept subspaces during inference.
  • Demonstrate that RCE yields robust improvements across diverse compositional benchmarks while maintaining stability and efficiency.

提出的方法

  • Maintain a frozen base model while injecting learnable, low-rank concept subspaces into the residual stream at a single decoder layer.
  • Spawn candidate subspaces when a failure signal based on predictive entropy and top-token margin triggers it.
  • Select concepts using a minimum description length (MDL) criterion balancing loss reduction against model complexity.
  • Merge concepts that exhibit synergistic co-activation via truncated SVD to form higher-order abstractions.
  • Prune and crystallize concepts to control library growth and preserve stability, with optional KL-constrained updates to limit distribution drift.

实验结果

研究问题

  • RQ1 Can a frozen pretrained language model benefit from inference-time expansion of its representational subspace via low-rank concept injection?
  • RQ2 Do MDL-based selection, orthogonality regularization, and synergy-driven merging yield robust, scalable improvements in compositional reasoning?
  • RQ3 Is RCE effective across multiple model scales and diverse compositional benchmarks (ARC-AGI-2, GPQA, MATH, BBH, HLE) while maintaining efficiency?

主要发现

  • RCE provides consistent accuracy gains over strong baselines across five compositional benchmarks and multiple model scales.
  • On Mistral-7B, RCE achieves ARC-AGI-2: 28.0%, MATH: 47.4%, BBH: 70.5%, GPQA: 41.4%, HLE: 18.7% (vs. 19.7, 41.3, 64.8, 34.2, 13.8 for DisCO).
  • RCE improves ARC-AGI-2 by 8.3 points over DisCO, GPQA by 7.2 points, and maintains notable gains on other tasks; at 14B scale, improvements remain significant.
  • Under distribution shifts, RCE preserves over 91% of standard ARC-AGI-2 accuracy, outperforming baselines (68–80% retention).
  • The concept library stabilizes (e.g., 47 concepts for Mistral-7B), with a hierarchical structure of primitives, merged abstractions, and domain-general tools.
  • RCE attains higher efficiency: ~1.04x base FLOPs with 4% overhead for MATH, outperforming token-heavy methods that incur 16–25x compute multipliers.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。