QUICK REVIEW

[論文レビュー] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Samuel Miserendino, Michele Wang|ArXiv.org|Feb 17, 2025

FinTech, Crowdfunding, Digital Finance被引用数 3

ひとこと要約

SWE-Lancer は 1,488 件の実際の Upwork ソフトウェア工学タスクを基準とする総額 100 万ドルのベンチマークで、エンドツーエンドのテストを用いて独立コーディングとマネジメントタスクの意思決定を frontier LLM に対して評価します。結果は現行モデルが完全な支払い額には far short であることを示しています。

ABSTRACT

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

研究の動機と目的

Measure frontier LLM capability on real-world freelance SWE tasks with real monetary payouts.
Evaluate both Independent Contributor SWE tasks and SWE Manager tasks requiring proposal evaluation.
Provide end-to-end, triple-verified testing to assess full-stack engineering performance.
Open-source a unified evaluation environment and a public evaluation split for reproducibility and research growth.

提案手法

Assemble 1,488 real Upwork tasks from Expensify with real payouts totaling $1M.
Split tasks into IC SWE (764 tasks) and SWE Manager (724 tasks) categories.
Use end-to-end Playwright-based tests for IC tasks, triple-verified by engineers.
Have models operate in a restricted Docker environment with local codebase access and no Internet; models receive a single pass (pass@1) unless noted.
Enable a user tool that lets models browse and interact with the local application, with outputs logged for evaluation.

実験結果

リサーチクエスチョン

RQ1Can frontier LLMs autonomously fix real-world software bugs and implement features in a full-stack codebase?
RQ2Can models effectively act as SWE managers by selecting the best implementation proposals among real freelancer submissions?
RQ3How does model performance map to actual Upwork payouts across varying task difficulties and domains?
RQ4What is the impact of tool use and test-time compute on model success rates in complex SWE tasks?

主な発見

Best-performing model (Claude 3.5 Sonnet) earns $208k on SWE-Lancer Diamond (26.2% IC SWE tasks solved) and over $400k on the full dataset; however, majority of solutions are incorrect.
Across IC SWE tasks, pass@1 and earnings rates stay below 30% for all models; SWE Manager tasks show higher success rates, with Sonnet 3.5 achieving 44.9%–45.0% on Diamond Manager tasks.
Increasing the number of attempts (pass@k) or test-time compute improves pass rates, especially for stronger models; e.g., Sonnet 3.5 reaches 26.2% IC Diamond pass@1, and higher reasoning effort lifts IC pass@1 from 9.3% to 16.5%.
Using a user tool enables better performance for top models; removing the tool reduces performance more for stronger models, indicating effective tool use is key to success.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。