Skip to main content
QUICK REVIEW

[论文解读] Beyond BeautifulSoup: Benchmarking LLM-Powered Web Scraping for Everyday Users

Arth Bhardwaj, Nirav Diwan|arXiv (Cornell University)|Jan 9, 2026
Web Application Security Vulnerabilities被引用 0
一句话总结

直接回答:论文基准测试传统抓取工具与基于大型语言模型的代理,评估非专业用户在35个站点上的能力差异,比较LAS与ELA工作流,并突出速度、可靠性与复杂性之间的权衡。

ABSTRACT

Web scraping has historically required technical expertise in HTML parsing, session management, and authentication circumvention, which limited large-scale data extraction to skilled developers. We argue that large language models (LLMs) have democratized web scraping, enabling low-skill users to execute sophisticated operations through simple natural language prompts. While extensive benchmarks evaluate these tools under optimal expert conditions, we show that without extensive manual effort, current LLM-based workflows allow novice users to scrape complex websites that would otherwise be inaccessible. We systematically benchmark what everyday users can do with off-the-shelf LLM tools across 35 sites spanning five security tiers, including authentication, anti-bot, and CAPTCHA controls. We devise and evaluate two distinct workflows: (a) LLM-assisted scripting, where users prompt LLMs to generate traditional scraping code but maintain manual execution control, and (b) end-to-end LLM agents, which autonomously navigate and extract data through integrated tool use. Our results demonstrate that end-to-end agents have made complex scraping accessible - requiring as little as a single prompt with minimal refinement (less than 5 changes) to complete workflows. We also highlight scenarios where LLM-assisted scripting may be simpler and faster for static sites. In light of these findings, we provide simple procedures for novices to use these workflows and gauge what adversaries could achieve using these.

研究动机与目标

  • 评估使用现成工具的初学者在网络抓取方面的民主化程度。
  • 在多样化站点保护下评估两种工作流——LLM 辅助脚本(LAS)和端对端 LLM 代理(ELA)。
  • 通过提取成功率、执行时间和人工工作量量化可用性与可靠性。
  • 提供关于在何时使用每种工作流的实际指导,并识别潜在滥用风险。

提出的方法

  • 定义覆盖五个安全等级的35个网站的基准。
  • 将传统抓取(BeautifulSoup、Scrapy)与端对端的 LLM 代理(Claude、Simular.ai)进行比较。
  • 使用固定提示和标准化评估环境来衡量成功、时间和人工工作量。
  • 通过提示 LLM 生成代码供用户运行来评估 LAS;通过代理驱动导航与提取来评估 ELA。
  • 记录结果包括访问、提取和基于 CSV 的数据输出,覆盖每个站点最多三次试验。
Figure 1: Benchmark for non-expert web scraping . We introduce a benchmark that evaluates what non-expert users can achieve with off-the-shelf tools, modeling two workflows: (i) LLM-assisted Scripting (LAS) and (ii) End-to-End LLM Agent (ELA) . LAS: the LLM drafts code that the user executes and man
Figure 1: Benchmark for non-expert web scraping . We introduce a benchmark that evaluates what non-expert users can achieve with off-the-shelf tools, modeling two workflows: (i) LLM-assisted Scripting (LAS) and (ii) End-to-End LLM Agent (ELA) . LAS: the LLM drafts code that the user executes and man

实验结果

研究问题

  • RQ1非专业用户能否使用现成的抓取工具现实地取得哪些成果?
  • RQ2在站点难度等级下,LAS 与 ELA 在成功率、速度和所需人工努力方面有何比较?
  • RQ3在带有认证、反机器人措施或 CAPTCHA 的站点上,LLM 代理是否比传统工具更有效?
  • RQ4对于静态与动态站点,自动化通过代理何时比脚本更具优势?
  • RQ5民主化的基于 LLM 的网页抓取带来哪些防守层面的含义?

主要发现

类别BeautifulSoupScrapyClaudeSimular.ai
简单 HTML0.930.821.001.00
复杂 HTML0.800.200.571.00
简单认证不支持不支持0.200.63
复杂认证不支持不支持0.120.70
验证码(CAPTCHA)不支持不支持0.050.10
  • 端到端的 LLM 代理在复杂且受保护的站点上显著优于脚本,能够在传统工具失败的情况下实现访问。
  • Simular.ai 在简单/复杂 HTML 上实现最高总体性能且 ESR 完美,在经过身份认证页面与 CAPTCHA 的情况下表现强但并非普遍。
  • 对静态 HTML,传统工具仍然更快且效果极好,LAS 在简单用例中不足2秒即可实现较高 ESR。
  • 在 CAPTCHA 与 MFA 密集的站点上,传统工具失败或需要极高的努力,而 LLM 代理仍可行,尽管运行时间较慢(几十秒)且需要更多重试。
  • 存在清晰的实力分界:LAS 更适用于静态提取,ELA 更适用于复杂、动态及受保护的内容。
Figure 2: Average execution time per category. Note that the y-axis is on a log scale (in seconds).
Figure 2: Average execution time per category. Note that the y-axis is on a log scale (in seconds).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。