QUICK REVIEW

[論文レビュー] Recursive Concept Evolution for Compositional Reasoning in Large Language Models

Sarim Chaudhry|arXiv (Cornell University)|Feb 17, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

This paper introduces Recursive Concept Evolution (RCE), a framework that enables pretrained language models to dynamically spawn, evaluate, and merge low-rank concept subspaces during inference to improve compositional reasoning, evaluated on ARC-AGI-2, GPQA, MATH, BBH, and HLE.

ABSTRACT

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model's latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.

研究の動機と目的

Identify architectural limitations of fixed latent geometries in pretrained LLMs for compositional reasoning.
Propose RCE to dynamically create, evaluate, and compose low-rank concept subspaces during inference.
Demonstrate that RCE yields robust improvements across diverse compositional benchmarks while maintaining stability and efficiency.

提案手法

Maintain a frozen base model while injecting learnable, low-rank concept subspaces into the residual stream at a single decoder layer.
Spawn candidate subspaces when a failure signal based on predictive entropy and top-token margin triggers it.
Select concepts using a minimum description length (MDL) criterion balancing loss reduction against model complexity.
Merge concepts that exhibit synergistic co-activation via truncated SVD to form higher-order abstractions.
Prune and crystallize concepts to control library growth and preserve stability, with optional KL-constrained updates to limit distribution drift.

実験結果

リサーチクエスチョン

RQ1Can a frozen pretrained language model benefit from inference-time expansion of its representational subspace via low-rank concept injection?
RQ2Do MDL-based selection, orthogonality regularization, and synergy-driven merging yield robust, scalable improvements in compositional reasoning?
RQ3Is RCE effective across multiple model scales and diverse compositional benchmarks (ARC-AGI-2, GPQA, MATH, BBH, HLE) while maintaining efficiency?

主な発見

Method	Model	ARC-AGI-2	MATH	BBH	GPQA	HLE
Base	Mistral-7B	12.4	28.6	51.3	24.1	8.2
CoT	Mistral-7B	15.1	34.2	57.8	28.5	10.1
SC (n=16)	Mistral-7B	16.8	37.1	60.2	30.3	11.4
ToT	Mistral-7B	17.3	36.8	59.5	31.0	11.9
GRPO	Mistral-7B	18.2	38.9	62.1	32.4	12.6
DisCO	Mistral-7B	19.7	41.3	64.8	34.2	13.8
RCE	Mistral-7B	28.0	47.4	70.5	41.4	18.7
Base	Llama-3-8B	14.1	31.4	54.7	27.3	9.6
RCE	Llama-3-8B	29.8	49.1	72.3	43.1	20.2
Base	Qwen-14B	19.3	42.8	63.5	36.7	14.3
RCE	Qwen-14B	33.6	54.2	76.1	48.9	23.1

RCE provides consistent accuracy gains over strong baselines across five compositional benchmarks and multiple model scales.
On Mistral-7B, RCE achieves ARC-AGI-2: 28.0%, MATH: 47.4%, BBH: 70.5%, GPQA: 41.4%, HLE: 18.7% (vs. 19.7, 41.3, 64.8, 34.2, 13.8 for DisCO).
RCE improves ARC-AGI-2 by 8.3 points over DisCO, GPQA by 7.2 points, and maintains notable gains on other tasks; at 14B scale, improvements remain significant.
Under distribution shifts, RCE preserves over 91% of standard ARC-AGI-2 accuracy, outperforming baselines (68–80% retention).
The concept library stabilizes (e.g., 47 concepts for Mistral-7B), with a hierarchical structure of primitives, merged abstractions, and domain-general tools.
RCE attains higher efficiency: ~1.04x base FLOPs with 4% overhead for MATH, outperforming token-heavy methods that incur 16–25x compute multipliers.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。