QUICK REVIEW

[论文解读] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai|arXiv (Cornell University)|Jan 15, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

ML-Master 2.0 引入分层认知缓存以实现超长 horizon 的自我驱动 ML 工程，在 MLE-Bench 上达到 56.44% 奖牌率，并在不同任务难度下展现出色表现。

ABSTRACT

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

研究动机与目标

将超长 horizon 的自治重新定义为认知积累，将瞬时经验转化为可重复使用的知识与智慧。
提出带有多层缓存与情境迁移的分层认知缓存（HCC），以管理长时域情境。
证明将短期执行与长期策略解耦可提升在 MLE 任务中的稳定性与性能。
在 OpenAI 的 MLE-Bench 上实证验证 HCC，显示最先进的奖牌率及跨任务复杂性鲁棒性。

提出的方法

引入三层分层认知缓存（L1：演变经验，L2： refined Knowledge，L3：Prior Wisdom）以将瞬时情境与稳定认知分离。
通过情境预取初始化、情境命中检索、情境提升整合实现情境迁移。
将 MLE 视为超长 horizon 的规划，采用阶段性分层计划与并行探索方向。
使用阶段级提升将轨迹压缩为 refined knowledge，任务级提升蒸馏可迁移的智慧。
在固定 24 小时预算下以奖牌率（铜/银/金）为主要指标对 MLE-Bench 进行评估。
利用暖启动的先验智慧缓存（L3）和任务无关的描述符嵌入实现跨任务迁移。

实验结果

研究问题

RQ1分层认知缓存能否在数十小时的自主探索中维持战略连贯性？
RQ2L1/L2/L3 组件是否对性能与稳定性产生协同作用？
RQ3认知积累如何影响低/中/高复杂度任务的奖牌率？
RQ4情境迁移（预取、命中、提升）对情境长度与学习效率有何影响？
RQ5与现有自主 ML 代理相比，ML-Master 2.0 在鲁棒性与迁移性方面在 MLE-Bench 的表现如何？

主要发现

ML-Master 2.0 在 MLE-Bench 上实现 56.44% 的平均奖牌率，为评估方法中最高值。
在低/中/高复杂度任务中的性能提升是一致的（分别为 75.8%、50.9%、42.2% 的奖牌率）。
情境长度被有效控制，在有 HCC 时峰值约 70k 令牌，若无 HCC 会增长至超过 200k 而失控。
消融实验显示去掉任一缓存层都会降低性能，L1（经验）是基础，L2（知识）对综合至关重要，L3（智慧）对跨任务迁移关键。
该方法展现鲁棒性，在相当比例的任务上超越人类表现（63.1% 的任务超过了 50% 的人类水平）。
ML-Master 2.0 展现出改进的奖牌质量分布（有效/奖牌率更高），并在任务难度增加时仍保持强基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。