QUICK REVIEW

[论文解读] Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su|arXiv (Cornell University)|Jan 8, 2025

Multi-Agent Systems and Negotiation被引用 24

一句话总结

Agent Laboratory 是一个自治的 LLM-代理框架，它以一个人类研究想法为输入，通过三个阶段（文献综述、实验、报告撰写）产生完整的研究成果（代码库和论文），在每个阶段获得人类反馈以提升质量并带来显著的成本降低。

ABSTRACT

Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

研究动机与目标

通过实现自治但由人类引导的机器学习研究工作流，加速科学发现。
在保持或提升产出质量的同时降低研究成本。
提供一个开源、计算灵活的框架，涵盖文献综述、实验和报告撰写。
评估多种 LLM 后端，以理解实验质量、报告质量和有用性之间的权衡。
评估自治模式与协同驾驶（共驾）模式，并量化它们对研究产出的影响。

提出的方法

三阶段流程：文献综述、实验、报告撰写。
博士与博士后代理协作规划实验、整理文献并制定数据准备与建模步骤。
MLE-Solver 迭代地产生、测试并改进 ML 代码，使用评分/奖励模型和自我反思以收敛到高质量实现。
Paper-Solver 生成并改进基于 LaTeX 的学术报告，借助 arXiv 访问和自动评审以模拟 NeurIPS 风格的反馈。
NeurIPS 风格评估比较自动评审与人工评审，以评估输出的一致性和质量。
协同驾驶模式在每个子任务结束后引入人类检查点，在继续之前修订输出。

实验结果

研究问题

RQ1Agent Laboratory 在端到端自治与协同驾驶配置下的表现如何？
RQ2哪些语言模型后端在实验质量、报告质量和有用性之间实现最佳平衡？
RQ3在不同阶段的人类反馈对整体研究质量有何影响？
RQ4在不同后端下，Agent Laboratory 的成本与运行时间特征是什么？
RQ5Agent Laboratory 是否能够在既定基准上实现有竞争力的机器学习代码和研究产出？

主要发现

自治输出因后端而异，o1-preview 提供最高的感知有用性和报告质量，而 o1-mini 提供最高的实验质量，gpt-4o 通常表现不佳。
人类始终推翻自动评估；自动评审相较于人类评估，往往高估质量。
协同驾驶模式的总体得分高于自治模式，表明在各阶段获得人类指导的好处。
Agent Laboratory 显著降低研究成本，相比先前的自治方法，成本降低高达 84%；以 gpt-4o 后端为例，每篇论文的代表性成本为 $2.33。
MLE-Solver 在 MLE-Bench 的部分挑战上实现了接近最先进水平的表现，表现出比竞争对手更高的一致性和奖牌数。
在所有模式中，自治运行的论文质量通常低于典型的 NeurIPS 接受阈值，强调需要进一步改进以适应一流会议。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。