QUICK REVIEW

[논문 리뷰] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai|arXiv (Cornell University)|2026. 01. 15.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

ML-Master 2.0은 Hierarchical Cognitive Caching을 도입하여 초장기 자율 ML 엔지니어링을 가능하게 하며, MLE-Bench에서 메달 비율 56.44%를 달성하고 과제 난이도 전반에서 우수한 성능을 보인다.

ABSTRACT

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

연구 동기 및 목표

초장기 자율성을 일시적 경험을 재사용 가능한 지식과 지혜로 변환하는 인지적 축적으로 재정의한다.
다계층 캐시와 컨텍스트 마이그레이션을 갖춘 Hierarchical Cognitive Caching(HCC)을 제안하여 장기간 컨텍스트를 관리한다.
단기 실행과 장기 전략의 분리가 MLE 작업의 안정성과 성능을 향상시킨다는 것을 보여준다.
OpenAI의 MLE-Bench에서 HCC를 실증적으로 검증하여 최첨단 메달 비율과 다양한 과제 복잡도에서도 강건성을 보인다.

제안 방법

세 수준의 Hierarchical Cognitive Cache(L1: Evolving Experience, L2: Refined Knowledge, L3: Prior Wisdom)를 도입하여 일시적 컨텍스트를 안정적 인지로 분리한다.
컨텍스트 프리패칭을 통한 초기화, 컨텍스트 히트에 의한 회수, 컨텍스트 프로모션에 의한 통합으로 컨텍스트 마이그레이션을 구현한다.
MLE를 위상 기반 계층적 계획과 병렬 탐색 방향을 갖는 초장기 계획으로 본다.
위상 수준 프로모션으로 궤적을 정제된 지식으로 압축하고 과제 수준 프로모션으로 이전 가능한 지혜를 도출한다.
고정된 24시간 예산 하에서 OpenAI의 MLE-Bench를 평가하고 메달 비율(Bronze/Silver/Gold)을 주요 지표로 사용한다.
사전 지혜 캐시(L3)와 교차 과제 전이를 위한 태스크 비의존 디스크립터 임베딩을 활용한다.

실험 결과

연구 질문

RQ1Hierarchical Cognitive Caching이 수십 시간에 걸친 자율 탐험에서도 전략적 일관성을 유지할 수 있는가?
RQ2L1/L2/L3 구성요소가 시너지적으로 성능과 안정성에 기여하는가?
RQ3인지적 축적이 저/중/고 복잡도 과제에서 메달 비율에 어떤 영향을 미치는가?
RQ4컨텍스트 마이그레이션(prefetching, hit, promotion)이 컨텍스트 길이와 학습 효율에 미치는 영향은 무엇인가?
RQ5ML-Master 2.0이 MLE-Bench에서 기존 자율 ML 에이전트와 비교했을 때 강건성과 전이성 측면에서 어떠한가?

주요 결과

ML-Master 2.0은 MLE-Bench에서 평균 메달 비율 56.44%를 달성했으며, 평가된 방법 중에서 가장 높다.
저/중/고 복잡도 과제에서 성능 향상이 일관되게 나타나며(각각 75.8%, 50.9%, 42.2% 메달 비율).
컨텍스트 길이가 효과적으로 제어되어, HCC 없이 200k을 넘는 무제한 증가에 비해 약 70k 토큰에서 정점에 이른다.
제거 실험은 어떤 캐시 레벨도 제거하면 성능이 저하되며, L1(경험)이 기초적이고, L2(지식)가 합성에 필수적이며, L3(지혜)가 교차 과제 전이에 중요하다는 것을 보여준다.
이 접근 방식은 강건함을 보여주고 상당 부분의 과제에서 인간 성과를 능가하며(63.1%의 과제에서 50%의 인간을 초과).
ML-Master 2.0은 메달-품질 분포가 개선되어(더 높은 valid/medal 비율) 과제 난이도가 증가해도 강력한 기준선을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.