QUICK REVIEW

[論文レビュー] WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements

Xiwen Teoh, Yun Lin|arXiv (Cornell University)|Feb 12, 2026

Software Testing and Debugging Techniques被引用数 0

ひとこと要約

WebTestPilot は、自然言語要件から implicit な oracle を推定するシンボリゼーション層を備えた LLM ベースのエージェントを使用し、多様な NL 入力に対して堅牢なエンドツーエンドのウェブテストとバグ検出を実現します。モデル間および実世界のノーコード展開にも一般化し、高いタスク完了とバグ検出の指標を達成します。

ABSTRACT

Visual language model (VLM) agents show great promise in automating end-to-end (E2E) web testing against requirements in natural language. However, the probabilistic nature of language models can have inherent hallucinations. Therefore, given a detected inconsistency between the requirement and the web application, it is hard to distinguish whether it stems from the hallucination or a real application bug. Addressing this issue presents two core technical challenges: the implicit oracle inference challenge, where the agent must act as its own oracle to implicitly decide if the application's behavior is correct without guidance, and the probabilistic inference challenge, where an LLM's inconsistent reasoning undermines its trustworthiness as an oracle. Existing LLM-based approaches fail to capture such implicit oracles, either by treating any page navigation that doesn't crash as a success, or by checking each state in isolation, thus missing bugs dependent on context from prior steps. We introduce WebTestPilot, an LLM-based agent designed to address these challenges. WebTestPilot uses (1) a symbolization layer which detects and symbolizes critical GUI elements on the web application into symbols (i.e., variables) and (2) translates natural language specification into a sequence of steps, each of which is equipped with inferred pre- and post-conditions over the symbols as an oracle. This oracle captures data, temporal, and causal dependencies, enabling the validation of implicit requirements. To advance research in this area, we build a benchmark of bug-injected web apps for evaluating NL-to-E2E testing. The results show that WebTestPilot achieves a task completion rate of 99%, with 96% precision and 96% recall in bug detection, outperforming the best baseline (+70 precision, +27 recall). The agent generalizes across diverse natural language inputs and model scales.

研究の動機と目的

自然言語要件を実行可能なエンドツーエンドのテスト手順と検証可能な oracle に翻訳する。
因果・時間・データなどの暗黙的な状態依存関係を推定して、クロス状態の正しさを検証する。
ニューラル推論とシンボリック DSL を組み合わせて地道で再利用可能な断定観を生成する。
新しい NL-to-E2E バグ検出ベンチマークと実世界のノーコード展開で評価する。

提案手法

自然言語要件を、条件・アクション・期待の順序付きステップ列に解析する。
シンボリゼーション層を用いてドメイン特有のシンボル（例：カート、製品）をスキーマとして抽出・具現化する。
シンボル上のプリ条件とポスト条件を Python 拡張 DSL で explicit/implicit 要件としてエンコードする。
GUI グラウンディングモデルによる粗い要素局在と set-of-Mark プロンプティングを併用して正確な実行可能アクションを生成する。
アサーションの LLM ヒューリスティクス誤謬を緩和するためのリトライと任意の多数決投票を実施する。
バグ injected Web アプリを用いたベンチマークを構築し、LLM ベースライン（NaviQAte、LaVague、PinATA）と比較評価する。

実験結果

リサーチクエスチョン

RQ1RQ1: WebTestPilot はベースラインの GUI テストエージェントと比較してテスト経路をどれだけ効果的に生成するか。
RQ2RQ2: 視覚的および機能的な欠陥を検出する能力は、既存のエージェントベースのベースラインと比較してどれくらい効果的か。
RQ3RQ3: 固定入力形式を持たない varied で構造化されていない自然言語要件に対して WebTestPilot はどれほど頑健か。
RQ4RQ4: モデルサイズと改良が lightweight LLMs を用いた WebTestPilot の性能にどう影響するか。

主な発見

NL-to-E2E テストベンチマークで 99% のタスク完遂を達成。
バグ検出で 96% の精度と 96% の再現率を達成。
ベンチマーク上、最も強力なベースラインより約 +70 の精度と +27 の再現率で上回る。
3B から 72B のパラメータ範囲で多様な NL 入力とモデル規模に一般化。
デプロイされた実世界のノーコードプラットフォームで 8 個のバグを発見。データバインディング、UI、ナビゲーションの問題を含む。
データ変換を含む 4 つのウェブアプリケーションにわたり、公開に有用な 110 個の注入バグを含むベンチマークを構築。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。