QUICK REVIEW

[論文レビュー] GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou|arXiv (Cornell University)|Jan 3, 2024

Topic Modeling被引用数 8

ひとこと要約

本論文は SeeAct を提案する。GPT-4V を用いてレンダリングされたウェブページを理解し実行可能なアクションを生成するジェネラリストなウェブエージェントであり、計画を HTML 要素へ対応づける Grounding 戦略を備える；オンライン評価では oracle grounding のもとで 50% のタスク成功率を示すが、grounding は依然ボトルネックである。

ABSTRACT

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.

研究の動機と目的

大規模マルチモーダルモデルが実世界のウェブサイト上でタスクを完了できるジェネラリストウェブエージェントになり得るかを評価する。
SeeAct を開発し、ウェブページの視覚的理解と HTML 要素へのアクション grounding を組み合わせる。
grounding 戦略を評価し、Mind2Web 上で GPT-4V ベースのエージェントとテキストのみ LLMs、ファインチューニング済みモデルを比較する。
オフラインベンチマークを補完するオンラインのライブウェブ評価を導入する。
grounding の精度と oracle grounding とのギャップを特定し、今後の改善を導く。

提案手法

アクション生成をブラウザアクションのテキスト記述として定義する。
説明を実行可能なブラウザイベント（e,e,o,v）へ変換して grounding を行う。
Grounding 方法を探索：Element Attributes、Textual Choices、Image Annotation。
Mind2Web を評価データセットとして、オフライン（キャッシュされたウェブサイト）とオンライン（ライブサイト）設定を用いる。
GPT-4V を、テキストのみの LLMs（GPT-4、GPT-3.5）および視覚調整モデル（FLAN-T5、BLIP2-T5）と比較する。
ライブウェブ実験のための Playwright ベースのオンライン評価ツールを提示する。

実験結果

リサーチクエスチョン

RQ1GPT-4V が grounding される場合、さまざまなウェブサイトとタスクを跨って効果的なジェネラリストウェブエージェントとして機能できるか。
RQ2複雑なウェブページレイアウトにおいて、どの grounding 戦略がテキストのアクションプランを HTML 要素へ最も効果的にマッピングするか。
RQ3オンライン（ライブ）評価とオフライン評価をウェブエージェントにとって比較するとどうなるか、差異はどう特徴づけられるか。
RQ4oracle grounding と実用 grounding との性能ギャップはいくつで、HTML- visuals の対応づけによる grounding で橋渡しできるか。

主な発見

Model	Cross-Task Ele.Acc	Cross-Task Op.F1	Cross-Task StepSR	Cross-Website Ele.Acc	Cross-Website Op.F1	Cross-Website StepSR	Cross-Domain Ele.Acc	Cross-Domain Op.F1	Cross-Domain StepSR
FLAN-T5 – Base	40.5	74.4	37.3	28.7	69.6	27.9	38.2	69.1	36.2
FLAN-T5 – Large	52.2	70.7	48.8	35.3	65.8	32.7	41.9	64.6	39.5
FLAN-T5 – XL	56.8	74.6	52.5	42.6	69.9	39.5	43.8	65.2	40.7
BLIP2-T5 – Base	39.5	74.9	36.1	34.0	70.8	32.2	38.2	72.8	37.5
BLIP2-T5 – Large	50.0	72.1	46.0	39.5	71.5	36.3	40.9	70.1	39.4
BLIP2-T5 – XL	52.9	74.9	50.3	41.7	74.1	38.3	43.8	73.4	39.6
GPT-3.5	19.4	59.8	16.8	14.9	56.5	14.1	25.5	57.9	24.2
GPT-4	40.2	63.4	31.7	27.4	61.0	27.0	36.2	61.9	29.7
SeeAct – Choice	48.9	69.1	40.6	48.5	70.6	41.7	44.0	70.9	40.9
SeeAct – Oracle	72.9	80.9	65.7	74.4	83.7	70.0	72.8	73.6	62.1
SeeAct – Attributes	4.7	39.5	4.7	9.7	37.8	9.7	16.0	41.4	15.3
SeeAct – Annotation	15.1	66.5	13.0	11.3	63.4	10.5	16.5	65.1	14.7

oracle grounding での SeeAct（GPT-4V）は 65.7%-70.0% の steps 成功率と 62.1%-65.7% の Ele. Acc を各分割で達成し、すべてのベースラインを上回る。
grounding は依然として最大のボトルネック。最良の実用 grounding（Textual Choices）は、image-annotation grounding を大きく上回り、監督付きファインチューニングに近づく。
SeeAct Choices は offline 評価のクロス・タスク、クロス・ウェブサイト、クロス・ドメイン設定で、テキストのみの GPT-4 を一貫して上回る。
オンライン評価では SeeAct Choice が GPT-4 や FLAN-T5-XL よりも全タスク完遂率が高く、oracle grounding では分割をまたいで 50% にまで向上する。
ICL（インコンテキスト学習）は、監視付きファインチューニングより未知のウェブサイトに対して一般化する傾向があり、大規模モデルのインコンテキスト能力の利点を示す。
実用 grounding と oracle grounding の間には 20-25% のギャップがあり、grounding がウェブエージェントの核心的な課題であることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。