Skip to main content
QUICK REVIEW

[论文解读] MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra|arXiv (Cornell University)|Feb 2, 2026
Scientific Computing and Data Management被引用 0
一句话总结

MARS 引入预算感知蒙特卡洛树搜索、模块化分解和比较性反思记忆,以自动化 AI 研究,在 MLE-Bench 上实现开源性能的最前沿并具备跨分支的良好泛化能力。

ABSTRACT

Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

研究动机与目标

  • 为自动化 AI 研究的独特挑战提供动机与解决方案,包括高成本评估与不透明的 credit attribution(归因)。
  • 提出一个框架(MARS),通过预算感知规划在性能与计算成本之间取得平衡。
  • 推广面向仓库级别的模块化构建,以管理架构复杂性并增强可测试性。
  • 引入比较性反思记忆,以提炼因果洞察并引导长 horizon 探索。

提出的方法

  • 实现预算感知蒙特卡洛树搜索(MCTS),通过一个效率导向的奖励(Eq. 4)在性能与执行成本之间取得平衡。
  • 采用模块化的设计-分解-实现管线,以用独立、可测试的模块替代单一化脚本,并实现基于差异的原子更新。
  • 引入比较性反思记忆,通过将当前解与最优已知解进行比较来提取高信号的经验教训,包括结构化调试和解题教训。
  • 使用三部分框架(任务准备、资源感知规划、模块化分解、反思记忆)将长时 horizon 的 AI 研究转化为仓库级问题。
  • 在 24 小时时钟预算下在 MLE-Bench 进行评估,报告 Above Median、Bronze、Silver、Gold 和 Any Medal 指标,并通过消融验证每个组成部分。
Figure 1 : The “Aha!” moment of MARS on the challenging iMet-2020-FGVC7 task. The visualization tracks validation performance gains triggered by specific strategic lessons. While existing methods fail to reach medal-level performance, MARS progressively refines its strategy – evolving from a lightwe
Figure 1 : The “Aha!” moment of MARS on the challenging iMet-2020-FGVC7 task. The visualization tracks validation performance gains triggered by specific strategic lessons. While existing methods fail to reach medal-level performance, MARS progressively refines its strategy – evolving from a lightwe

实验结果

研究问题

  • RQ1预算感知规划如何提高长时 horizon AI 研究任务的效率?
  • RQ2模块化分解是否能提升复杂研究管线的解质量与可维护性?
  • RQ3比较性反思记忆是否能实现有效的归因分配并加速长时 horizon 学习?
  • RQ4教训学习对跨分支迁移与探索动态的影响如何?
  • RQ5在现实约束下,MARS 相对于开源基线在 MLE-Bench 的表现如何?

主要发现

  • MARS 在可比设置下在 MLE-Bench 的开源框架中达到最前沿的性能。
  • MARS+ 在提升计算量后达到最高的 Above Median、Gold Medal 与 Any Medal 率,超过领先基线。
  • 消融研究表明模块化分解和课程学习显著提升性能。
  • 预算感知 MCTS 产生更高的有效解率,在性能相近时偏好更快的候选解,推动发现速度。
  • 教训具有高利用率和跨分支迁移能力,表明洞察可在搜索路径间有效泛化。
Figure 2 : Overview of the MARS Framework. MARS reformulates long-horizon coding as a search for an optimal software repository. (1) Task Preparation: The agent grounds the abstract problem (Instruction, Environment, Objective) tuple by exploratory analysis of the given dataset and metadata. (2) The
Figure 2 : Overview of the MARS Framework. MARS reformulates long-horizon coding as a search for an optimal software repository. (1) Task Preparation: The agent grounds the abstract problem (Instruction, Environment, Objective) tuple by exploratory analysis of the given dataset and metadata. (2) The

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。