QUICK REVIEW

[論文レビュー] MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan|arXiv (Cornell University)|Oct 9, 2024

Machine Learning and Data Classification被引用数 7

ひとこと要約

MLE-benchはオフラインの Kaggle ベースのベンチマークで、75 の競技を横断する自律的な ML エンジニアリングタスクに対する AI エージェントを評価し、人間のベースラインと OpenAI の scaffold が限定的ながら意味のあるメダル獲得率を示す。

ABSTRACT

This paper develops a theory of search stability for long-running agents operating under finite active context, delayed verification, sparse expensive feedback, path-dependent lock-in, and lossy state compression. The focus is not only on model quality, but on the mesoscopic law layer that governs how an agent should preserve, retire, substitute, compress, branch, and reset competing hypotheses or route summaries over time. The framework models search state as an active hypothesis portfolio partitioned into coarse families under a context budget. Each item carries promise, verification lag, retention cost, staleness, overlap burden, and inertia. A central contribution is a set-valued adequacy semantics: within each discrimination window, the system is associated with a nonempty random set of operationally adequate families induced by the realized initial information state and downstream randomness. Success is defined as preserving recoverability of at least one adequate family at the first strongly discriminating verification stage, avoiding dependence on a selector-defined pseudo-truth. The paper derives threshold and impossibility results for context contamination, shadow retirement, delayed-verification coverage, reserve feasibility, and budget-limited adequacy. It also develops a theory of within-family semantic substitution, compressed-control alias hazard, reset admissibility, stale-legacy drift, diagnostic regret decomposition, and rolling-window lifting for long-running agents with repeated verification stages and changing task modes. The intended contribution is an audit-and-design law layer for bounded-memory AI systems. The theory is deliberately narrow and conditional, but it aims to make long-horizon agent failures more diagnosable: separating failures caused by bounded-memory hypothesis ecology from failures caused by raw model weakness, and from mixtures of both.

研究の動機と目的

Real-worldのようなタスクでAIエージェントの自律的なMLエンジニアリング能力を動機付け、測定する。
コアMLエンジニアリングスキルを表す diverseで挑戦的なKaggle競技のセットをキュレーションする。
私設 Kaggle リーダーボードによる人間のベースラインを確立し、それと前線モデルを評価する。
スキャフォールディング、モデル選択、計算資源がエージェントの性能に与える影響を調査する。
オープンソース化して自律MLエンジニアリングの継続的研究を可能にする。

提案手法

75 件のキュレーション済みタスクと対応データセット、トレーニングスクリプト、採点ロジックを備えたオフライン Kaggle コンペ環境を作成する。
プライベートリーダーボードベースのメダル（ブロンズ/シルバー/ゴールド）でエージェントの性能を測定し、単一のメダルレートの見出し指標を算出する。
複数のエージェントスキャフォールド（AIDE、MLAB、OpenHands）とさまざまなモデル（o1-preview、GPT-4o、Claude、Llama）を評価して最良の組み合わせを特定する。
pass@k（複数回試行）、計算資源、拡張時間予算のアブレーションを実施して性能の天井を把握する。
解答の熟知性と混入のリスクを分析し、競技説明を隠蔽した上で汚染リスクと盗作チェックを行う。

実験結果

リサーチクエスチョン

RQ1自律AIエージェントはMLエンジニアリングタスクでKaggle風のメダルを獲得できるか？
RQ2スキャフォールドと基盤モデルは実世界タスクのエンドツーエンドのMLエンジニアリング性能にどう影響するか？
RQ3より多くの試行、より多くの計算、あるいはより長い時間予算はメダル獲得にどのように影響するか？
RQ4データ/解法の記憶や汚染がこのベンチマーク上のエージェントの性能を高めるか？
RQ5MLE-benchは現代の Kaggle 競技における人間レベルの性能とどう比較されるか？

主な発見

モデル	提出作成割合 (%)	有効な提出割合 (%)	上位中央値以上割合 (%)	ブロンズ (%)	シルバー (%)	ゴールド (%)	いずれかメダル (%)
AIDE o1-preview	98.4 ± 0.4	82.8 ± 1.1	29.4 ± 1.3	3.4 ± 0.5	4.1 ± 0.6	9.4 ± 0.8	16.9 ± 1.1
GPT-4o (AIDE)	70.7 ± 0.9	54.9 ± 1.0	14.4 ± 0.7	1.6 ± 0.2	2.2 ± 0.3	5.0 ± 0.4	8.7 ± 0.5
llama-3.1-405b-instruct	46.3 ± 2.9	27.3 ± 2.6	6.7 ± 1.4	0.0 ± 0.0	1.3 ± 0.7	1.7 ± 0.7	3.0 ± 1.0
claude-3-5-sonnet-20240620	68.9 ± 3.1	51.1 ± 3.3	12.9 ± 2.2	0.9 ± 0.6	2.2 ± 1.0	4.4 ± 1.4	7.6 ± 1.8

最も良いパフォーマンス設定（o1-preview + AIDE）は、競技の平均で16.9%のメダルを獲得。
GPT-4o + AIDE は8.7%のメダルを達成し、競技ごとに100時間をかけると11.8%に増加。
複数回の試行（pass@k）によりメダル獲得が向上し、例として GPT-4o/AIDE および o1-preview/AIDE では pass@6 が pass@1 の約2倍程度のメダルを達成。
ハードウェアの違い（CPUのみ、単一/デュアル A10 GPU）にもかかわらず GPT-4o/AIDE の性能は設定間で堅牢。
汚染と盗作チェックによりスコアの体系的なインフレは見られず、メダル獲得 submissions に盗作は検出されず。
より高い時間予算はメダルを生むが、ツールの有効性と採点選択が時系列で観測されるメダル順序に影響を与える可能性。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。