QUICK REVIEW

[论文解读] OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Zhiyong Wu, Chengcheng Han|arXiv (Cornell University)|Feb 12, 2024

Multi-Agent Systems and Negotiation被引用 5

一句话总结

OS-Copilot 提供了一个操作系统级通用智能体框架，并引入 FRIDAY，这是一个通过自我导向的课程学习来自行改进的具体现身代理，学习控制未见应用，在 GAIA 基准测试上取得显著改进。

ABSTRACT

Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.

研究动机与目标

促使开发能够与多样化操作系统组件（网页、文件、终端、应用）交互的通用计算代理。
提出统一的 OS 交互接口和基于记忆的配置器，以实现跨应用的泛化。
展示一个自我改进的具体现身代理（FRIDAY），通过自主课程学习学习控制不熟悉的应用。
展示 FRIDAY 在 GAIA 上的表现及其在某些任务如 Excel 和 PowerPoint 上实现自我导向学习的能力。

提出的方法

将 OS-Copilot 作为一个通用的 OS 交互框架引入，整合 Python 解释器、bash、鼠标/键盘控制和 API 调用。
定义带有执行、批评和记忆模块的计划器、配置器和执行者，用以分解任务并收集反馈。
使用有向无环图规划器来建模任务依赖并实现并行子任务。
实现宣告性记忆（用户画像、语义知识）和程序性记忆（工具库），用于长期知识与技能。
在 FRIDAY 中，采用自我导向学习模块，代理为不熟悉的应用提出任务集并通过解决任务来积累工具。
在 GAIA 上对 FRIDAY 进行评估，包含消融实验（FRIDAY 无学习）并与 AutoGPT-4、GPT-4 Plugins 等基线进行比较。

实验结果

研究问题

RQ1操作系统层面的语言代理是否能够在网页和终端以外的广泛应用中实现泛化？
RQ2自我改进循环（规划、执行、批评、改进和学习）是否能提升开放世界 OS 任务的性能？
RQ3自我导向学习如何使未见应用获得新的工具和能力？
RQ4与现有系统相比，_FRIDAY_ 在 GAIA 上的性能与泛化能力如何？

主要发现

级别	级别 1	级别 2	级别 3	人类*
级别	级别 1	级别 2	级别 3	人类*
GPT-4	9.68	1.89	0
GPT-4-Turbo	9.68	6.92	0
AutoGPT-4	15.05	0.63	0
GPT-4 Plugins	30.30	9.70	0
FRIDAY w/o learning	36.56	17.61	6.12
FRIDAY	40.86	20.13	6.12

FRIDAY 在 GAIA 级别-1 任务上达到 40.86% 的成功率，相较于之前的最佳系统（30.3%）实现了相对提升约 35%。
FRIDAY 在 GAIA 级别-2 任务上达到 20.13%，在级别-3 上达到 6.12%，超越了若干基线。
FRIDAY 无学习时就已超越基线，显示其架构的有效性，而自我导向学习进一步提升性能。
在自我导向学习的电子表格任务数据集上，FRIDAY 达到 60% 的成功率，超过了 SheetCopilot 的基线。
FRIDAY 能以最小监督学习来控制 Excel 和 PowerPoint，并自主积累工具。
该框架强调计划者、批评者和提炼者在实现超越单纯工具数量的高级泛化中的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。