QUICK REVIEW

[论文解读] MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra|arXiv (Cornell University)|Feb 2, 2026

Scientific Computing and Data Management被引用 0

一句话总结

MARS 引入预算感知蒙特卡洛树搜索、模块化分解和比较性反思记忆，以自动化 AI 研究，在 MLE-Bench 上实现开源性能的最前沿并具备跨分支的良好泛化能力。

ABSTRACT

Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

研究动机与目标

为自动化 AI 研究的独特挑战提供动机与解决方案，包括高成本评估与不透明的 credit attribution（归因）。
提出一个框架（MARS），通过预算感知规划在性能与计算成本之间取得平衡。
推广面向仓库级别的模块化构建，以管理架构复杂性并增强可测试性。
引入比较性反思记忆，以提炼因果洞察并引导长 horizon 探索。

提出的方法

实现预算感知蒙特卡洛树搜索（MCTS），通过一个效率导向的奖励（Eq. 4）在性能与执行成本之间取得平衡。
采用模块化的设计-分解-实现管线，以用独立、可测试的模块替代单一化脚本，并实现基于差异的原子更新。
引入比较性反思记忆，通过将当前解与最优已知解进行比较来提取高信号的经验教训，包括结构化调试和解题教训。
使用三部分框架（任务准备、资源感知规划、模块化分解、反思记忆）将长时 horizon 的 AI 研究转化为仓库级问题。
在 24 小时时钟预算下在 MLE-Bench 进行评估，报告 Above Median、Bronze、Silver、Gold 和 Any Medal 指标，并通过消融验证每个组成部分。

Figure 1 : The “Aha!” moment of MARS on the challenging iMet-2020-FGVC7 task. The visualization tracks validation performance gains triggered by specific strategic lessons. While existing methods fail to reach medal-level performance, MARS progressively refines its strategy – evolving from a lightwe

实验结果

研究问题

RQ1预算感知规划如何提高长时 horizon AI 研究任务的效率？
RQ2模块化分解是否能提升复杂研究管线的解质量与可维护性？
RQ3比较性反思记忆是否能实现有效的归因分配并加速长时 horizon 学习？
RQ4教训学习对跨分支迁移与探索动态的影响如何？
RQ5在现实约束下，MARS 相对于开源基线在 MLE-Bench 的表现如何？

主要发现

MARS 在可比设置下在 MLE-Bench 的开源框架中达到最前沿的性能。
MARS+ 在提升计算量后达到最高的 Above Median、Gold Medal 与 Any Medal 率，超过领先基线。
消融研究表明模块化分解和课程学习显著提升性能。
预算感知 MCTS 产生更高的有效解率，在性能相近时偏好更快的候选解，推动发现速度。
教训具有高利用率和跨分支迁移能力，表明洞察可在搜索路径间有效泛化。

Figure 2 : Overview of the MARS Framework. MARS reformulates long-horizon coding as a search for an optimal software repository. (1) Task Preparation: The agent grounds the abstract problem (Instruction, Environment, Objective) tuple by exploratory analysis of the given dataset and metadata. (2) The

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。