QUICK REVIEW

[論文レビュー] Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

Xi Li|arXiv (Cornell University)|Mar 16, 2026

Robotic Path Planning Algorithms被引用数 0

ひとこと要約

要約: 論文は LLM エージェントに導かれた 10,469 件の自律型 ML 実験を分析し、ハイパーパラメータではなくアーキテクチャの選択が性能を左右することを示し、収束、マルチエージェントの動力学、タスク転移を特徴づける。

ABSTRACT

When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that extbf{architectural choices explain 94\% of performance variance} ($F = 1324$, $η^2 = 0.94$), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at $N = 50$, LLM-guided search reaches AP $= 0.985$ versus $0.965$ for from-scratch random search. Post-bugfix convergence follows a power law ($c = 0.11$, $R^2 = 0.93$); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.

研究の動機と目的

LLM エージェントが固定空間内で真のアーキテクチャ探索を行うかどうかを示す。
大規模な自律 ML キャンペーンにおけるアーキテクチャの選択とハイパーパラメータの寄与を性能分散に対して定量化する。
情報理論的指標を用いた LLM-guided の組合せ実験探索を分析する実証フレームワークを開発する。
自律 ML 研究ループにおける収束ダイナミクス、マルチエージェント挙動、専門化パターンを特徴づける。
LLM-guided 研究ダイナミクスをベンチマークする自律実験の大規模データセットを公開する。

提案手法

自律 ML 研究を、構造化された組合せ可能構成空間 C = CC_arch D7 CC_loss D7 CC_train D7 CC_data（108,000 個の離散セルを含む）として定式化する。
LLM エージェントを文脈化探索ポリシーとして定義 C_t | H_{t-1} し、情報理論的指標（エントロピー、Jensen-Shannon 発散、イノベーション率）で探索ダイナミクスを分析する。
収束モデル AP^{*}(N) = a - b N^{-c} を提案し、乱択、TPE、LLM ベースラインで指数 xPolicy を適合させる。
ANOVA を用いて backbone/encoder グループ間のハイパーパラメータ寄与とアーキテクチャ寄与を比較し、説明分散を定量化する η^2 を算出する。
インフラの混乱要因から探索ダイナミクスを isolating するポスト修正サブセットを用い、ランダム、TPE などのベースラインと比較する。
自律 ML 研究ダイナミクスをベンチマークする 10,469 実験のオープンデータセットを提供する。

Figure 1: System overview. Two LLM agents observe the shared leaderboard and propose configurations $c_{t}\in\mathcal{C}$ . The orchestrator deduplicates proposals, schedules execution on a GPU cluster, and updates the history $H_{t}$ . Self-healing handles runtime failures via LLM-assisted diagnosi

実験結果

リサーチクエスチョン

RQ1LLM 指導の実験は真のアーキテクチャ探索を生み出すか、単なるハイパーパラメータ調整か？
RQ2アーキテクチャ選択とハイパーパラメータの寄与分散のどの部分が AP の分散を説明するか？
RQ3LLM 指導探索の収束ダイナミクスは乱択・ベイズベースラインとどう比較されるか？
RQ42つの LLM が協働する場合のマルチエージェント探索ダイナミクスと専門化の特徴は何か？
RQ5アーキテクチャ支配はタスクやデータセット間で伝搬するか？

主な発見

ランク	エンコーダ	gamma	AP	ソース
1	Zipformer	3.0	0.9245	LLM
2	Zipformer	2.5	0.9206	HP-sweep
3	Zipformer	3.0	0.9203	LLM
4	Retention	2.5	0.9132	LLM
5	Hybrid R-M	2.5	0.9054	LLM

修正後の AP 分散の 94% をアーキテクチャ選択が説明する（η^2 = 0.94; F = 1324.3）。
固定アーキテクチャ内でのハイパーパラメータ変動は AP 分散の 6% を説明。
タスク間跨ぎの検証では、別データセットで勝利したバックボーンが 75% の AP 分散を説明。
トップ構成（AP = 0.9245）は V-JEPA 2 バックボーンと Zipformer エンコーダ、 focal loss（γ = 3.0）を使用。
LLM 指導探索は N = 50 で AP = 0.985 を達成し、ランダム探索の AP = 0.965 を上回る；ポスト修正後の収束は c = 0.11 のべき法則に従い（R^2 = 0.93）。
マルチエージェントのダイナミクスは、探索-利用のエントロピーサイクルと、専門化のスパイク-ディケイパターン（Jensen–Shannon 発散）を示し、エージェントが勝ち筋アーキテクチャへ収束する。

Figure 2: Convergence of cumulative best AP. Left: full campaign (10,469 experiments) showing discrete jumps at bug-fix events (vertical dashed lines). Right: post-bugfix subset (3,003 experiments) with a cleaner power-law fit ( $R^{2}=0.93$ ). Both $\pi_{\mathrm{rand}}$ and $\pi_{\mathrm{TPE}}$ ope

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。