QUICK REVIEW

[论文解读] Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Yifei Zhang, Xu Yang|arXiv (Cornell University)|Mar 2, 2026

Machine Learning and Data Classification被引用 0

一句话总结

简述：Presents Gome, a gradient-based MLE agent that uses structured reasoning, momentum-like memory, and multi-trace collaboration to outperform tree-search baselines on MLE-Bench, with performance improving as reasoning capability increases.

ABSTRACT

LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce extsc{Gome}, an MLE agent that operationalizes gradient-based optimization. extsc{Gome} maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, extsc{Gome} achieves a state-of-the-art 35.1\% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm. We release our codebase and GPT-5 traces at https://github.com/microsoft/RD-Agent.

研究动机与目标

Motivate a shift from score-based tree search to gradient-like optimization for MLE agents as LLM reasoning improves.
Map LLM-driven reasoning to structured optimization components (gradient signals, momentum, distributed updates).
Evaluate Gome against strong baselines on MLE-Bench under a closed-world protocol to isolate architectural effects.
Analyze how Gome scales with model capability across multiple GPT/LLM tiers.
Provide ablations and a scalable design that enables reproducibility (code and GPT-5 traces).

提出的方法

Propose Gome, a chain-based optimization framework where each step updates the pipeline along LLM-generated improvement directions.
Use a four-stage loop per iteration: Execute feedback, Hierarchical validation, Success memory update, and Structured reasoning to generate the next hypothesis.
Introduce a shared success memory (momentum) and a multi-trace (distributed) optimization setup to coordinate improvements.
Treat reasoning as gradient signals rather than scalar score ranking, with candidate hypotheses scored across multiple dimensions and sampled from top-k.
Enforce forced diversification across N parallel traces and use cross-trace memory and LLM-based selection to guide Hypotheses.
Evaluate under a closed-world protocol on MLE-Bench with a 12-hour budget on V100 GPU across frontier models (GPT-5, o3, DeepSeek variants).

实验结果

研究问题

RQ1Can gradient-based optimization (as instantiated by Gome) surpass tree-search-based MLE agents as LLM reasoning capability grows?
RQ2How do structured reasoning, momentum-like memory, and multi-trace collaboration contribute to performance and robustness in MLE tasks?
RQ3What is the scaling behavior of gradient-based MLE agents across model tiers from efficiency to frontier reasoning models?
RQ4What is the impact of a closed-world protocol on evaluating MLE agents and how does Gome perform under such constraints?

主要发现

Gome achieves state-of-the-art any-medal rate (35.1%) on MLE-Bench under a 12-hour budget with GPT-5.
Gome attains 96.0% valid submission rate and 16.4% Gold medals on MLE-Bench (GPT-5).
Gradient-based optimization gains widen with stronger reasoning models, surpassing tree search by up to 7.1 percentage points on frontier models.
Ablations show structured reasoning, success memory, and multi-trace optimization each meaningfully improve medal rates; removing any component degrades performance.
Scaling analysis reveals a clear phase transition: gradient signals outperform tree search as reasoning capability increases ( Efficiency < Advanced < Frontier ).
48-hour and half-budget ablations indicate stronger models gain more benefit from increased compute, suggesting potential for further gains with more time or reasoning quality.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。