QUICK REVIEW

[论文解读] OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li|arXiv (Cornell University)|Jul 23, 2024

Multi-Agent Systems and Negotiation被引用 7

一句话总结

OpenHands 是一个以社区为驱动的平台，使 AI 代理能够通过编写代码、在 Docker 沙箱中执行以及浏览网页来与世界互动，在 15 个基准测试中进行评估，并支持多代理委托。

ABSTRACT

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.

研究动机与目标

激励需要具备软件开发能力和现实世界交互能力的 AI 代理。
提出一个构建、评估和安全运行通用/专业代理的平台架构。
通过多代理委托和共享评估框架实现代理协作。
提供一个开放、由社区驱动的代理与工具中心，以加速研究与应用。

提出的方法

定义一个通过动作与观测事件流与环境交互的代理抽象。
提供一个使用 Docker 沙箱的运行时，具备 bash 解释器、IPython 服务器，以及基于 Playwright 的浏览器用于执行动作。
引入基于语言层（PL）的动作原语（IPythonRunCellAction、CmdRunAction、BrowseInteractiveAction）以及一个用于可重用工具的 AgentSkills 库。
通过 AgentDelegateAction 支持多代理委托，将专门化代理组合成解决任务的系统。
提供一个包含 15 项基准测试的评估框架，覆盖软件工程、网页浏览和其他任务。

实验结果

研究问题

RQ1OpenHands 如何定义和实现像人类软件工程师一样工作的多才多艺代理？
RQ2在 Docker 沙箱运行时，动作如何转化为观测？
RQ3如何通过可重用的技能和工具在不同任务中扩展代理？
RQ4OpenHands 代理在软件工程、网页浏览和其他帮助基准测试中的有效性如何？

主要发现

OpenHands 在软件工程、网页浏览和其他基准测试中，利用通用型和专业型代理实现了具有竞争力的表现。
在 HumanEvalFix（Python 子集）上，OpenHands CodeActAgent 实现 79.3% 的错误修复成功率，优于许多非代理基线。
网页浏览结果显示 OH BrowsingAgent 和委托配置的表现具有竞争力；例如 OH BrowsingAgent v1.0 配合 gpt-4o-mini-2024-07-18 在 MiniWoB++ 上无成本达到 27.2% 的成功率，配合 gpt-4o-2024-05-13 达到 40.8%。
在其他基准测试中，OH CodeActAgent v1.8 搭配 Claude-3-5-Sonnet 在 GPQA 上达到 52.0%，以及 OH CodeActAgent v1.5 在 MINT 代码子集上搭配 gpt-4o-2024-05-13 达到 77.3%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。