QUICK REVIEW

[論文レビュー] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

İzzeddin Gür, Hiroki Furuta|arXiv (Cornell University)|Jul 24, 2023

Web Data Mining and Analysis被引用数 16

ひとこと要約

WebAgent は HTML-specialized planning/summarization モデル（HTML-T5）と grounding code-generation モデル（Flan-U-PaLM）を組み合わせて、Python Selenium プログラムを通じて実世界のウェブサイトを自動化し、ベースラインより50%以上高い成功率を達成し、Mind2Webと MiniWoB++ で最先端の結果を出した。

ABSTRACT

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

研究の動機と目的

自動的にウェブ自動化を推進し、オープンエンドなアクションと長い HTML 文書がある実サイトで機能させる。
自己監督の下で計画、HTML 要約、およびプログラム生成を扱うための専門ドメイン言語モデルを開発する。
言語計画をブラウザ操作用の実行可能な Python スクリプトへ grounding して、エンドツーエンドのタスク完遂を可能にする。
実サイトと標準ベンチマークで、従来のベースラインより一般化と堅牢性の向上を示す。

提案手法

長い HTML 文書に特化した局所・グローバルアテンションを備えたエンコーダ-デコーダモデル HTML-T5 を導入。
大規模 HTML コーパス（CommonCrawl）上で長 span denoising 目的の混合で事前学習を行い、自己経験監督でファインチューニング。
サブ指示と HTML 断片を実行可能な Python Selenium スクリプトへ翻訳する grounded code generator として Flan-U-PaLM を用いる。
実サイト操作からデモを収集する自己経験監督を用いて、計画と要約のために HTML-T5 をファインチューニング。
計画（HTML-T5）とプログラム合成（Flan-U-PaLM）をモジュラーな WebAgent アーキテクチャで統合し、オープンエンドなアクションと長文 HTML に対応。
実サイト（不動産、ソーシャルメディア、地図）とベンチマーク HTML タスク（MiniWoB++, Mind2Web）で評価。

Figure 1: Challenges in real-world web automation. Recent language model agents (Furuta et al., 2023 ; Gur et al., 2022 ; Kim et al., 2023 ; Yao et al., 2022b ) can navigate simulated websites (Shi et al., 2017 ; Yao et al., 2022a ) , where the agents manipulate pre-defied actions and receive simpli

実験結果

リサーチクエスチョン

RQ1モジュラーな専門言語モデルの組み合わせは、単一の LLM アプローチと比べて実世界のウェブ自動化をどのように改善するか？
RQ2HTML に焦点を当てた計画と長い HTML 要約は、長文ドキュメントを扱う実サイトで堅牢なタスク grounding を可能にするか？
RQ3自己経験監督が計画の正確さと全体的なタスク成功に与える影響は？
RQ4HTML-T5 は標準 HTML ベースのベンチマーク（MiniWoB++, Mind2Web）で、従来の方法と比較してどうか？

主な発見

WebAgent はベースラインと比較して実世界のウェブ自動化の成功を50%以上向上させる。
HTML-T5 は MiniWoB++ で従来の言語モデルエージェントより18.7%、Mind2Webで最先端を達成。
HTML-T5 は自己経験監督によりより良い計画と HTML 要約を実現し、全体のタスク成功を高める。
Mind2Web のオフラインアクション予測では HTML-T5 の XL variant がタスク/ウェブサイト/ドメイン一般化で SoTA を達成。
実世界評価では、最良の組み合わせ計画+要約モジュール（WebAgent）が、オープンループ計画や正規表現ベースの要約を上回る。
アブレーション研究は、適応的なサブ指示計画と HTML 対応の要約が成功に不可欠であることを示している。

Figure 3: WebAgent is a combination of LLMs: HTML-T5 for planning and summarization, and Flan-U-PaLM for grounded program synthesis. WebAgent can handle the bottlenecks in the real-world tasks; open domain action space , complex natural language instructions , and long HTML .

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。