QUICK REVIEW

[论文解读] WebNavigator: Global Web Navigation via Interaction Graph Retrieval

Xuanwang Zhang, Yuteng Han|arXiv (Cornell University)|Mar 20, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

WebNavigator 将网页导航重新构建为在预构建的交互图上的确定性 Retrieve-Reason-Teleport 工作流，在 WebArena 和 Online-Mind2Web 上取得最先进的结果，使用紧凑的 6-动作界面。

ABSTRACT

Despite significant advances in autonomous web navigation, current methods remain far from human-level performance in complex web environments. We argue that this limitation stems from Topological Blindness, where agents are forced to explore via trial-and-error without access to the global topological structure of the environment. To overcome this limitation, we introduce WebNavigator, which reframes web navigation from probabilistic exploration into deterministic retrieval and pathfinding. WebNavigator constructs Interaction Graphs via zero-token cost heuristic exploration offline and implements a Retrieve-Reason-Teleport workflow for global navigation online. WebNavigator achieves state-of-the-art performance on WebArena and OnlineMind2Web. On WebArena multi-site tasks, WebNavigator achieves a 72.9\% success rate, more than doubling the performance of enterprise-level agents. This work reveals that Topological Blindness, rather than model reasoning capabilities alone, is an underestimated bottleneck in autonomous web navigation.

研究动机与目标

将从被动的、试探式的网页导航转向使用持久环境图进行全局规划的动机。
提出离线交互图构建以捕捉站点拓扑，而不依赖大语言模型。
引入带有 Retrieve-Reason-Teleport 工作流的在线检索增强导航，用于确定性导航。
在 WebArena 和 Online-Mind2Web 上展示最先进的性能，突出动作空间的缩减与跨站点泛化的提升。
提供经验证据表明拓扑盲点是基本瓶颈，环境知识能显著提升规划效率。

提出的方法

通过离线启发式自动探索构建交互图 G，与动态元素交互并捕获多模态观测（屏幕截图和结构化元数据）。
将所有节点嵌入并索引到向量数据库，以在在线导航过程中实现无 LLM 调用的检索。
在推理阶段，使用三阶段全局视图导航器：通过多模态检索检索前 k 个候选节点，使用多模态 LLM 进行推理以选择最佳候选，并通过在 G 中计算到目标节点的最短路径实现 Teleport。
使用统一的 navigate(domain,query) 动作来封装规划、域切换和低级浏览器状态管理。
使用后期交互、Token 级嵌入相似度进行检索，以保持查询与观测之间的细粒度对齐。
证明使用紧凑的 6-动作界面和全局图遍历相较于纯粹的被动基线能够实现确定性、全局最优导航。

Figure 1: Overview of WebNavigator. WebNavigator resolves Topological Blindness via a two-phase paradigm. (1) Offline Interaction Graph Construction . A heuristic auto-exploration engine discovers dynamic page observations at zero-token cost and indexes all observations into a vector database. (2) O

实验结果

研究问题

RQ1一个紧凑的离线构建的交互图是否能够捕获足够的全局结构以实现确定性网页导航？
RQ2将导航从概率探索迁移到图上的 Retrieve-Reason-Teleport 是否能缓解跨站点的拓扑盲点？
RQ3知识完备性与多模态检索带宽对导航成功有何影响？
RQ4使用后期交互检索相对于密集嵌入在检索质量与导航性能上有何影响？
RQ5统一的、领域无关的 navigate(domain,query) 接口是否足以在多网站上实现泛化？

主要发现

WebNavigator 在 WebArena 和 Online-Mind2Web 上实现最先进的性能，其中多站点任务在 Gemini-2.5-Pro 下达到 72.9%（相比企业级 CUGA）。
在 WebArena 的多站点任务中，WebNavigator 在 GPT-4o 下达到 50.0% 的成功率，在 Gemini-2.5-Pro 下达到 63.3%，显著超越以往方法。
在 Online-Mind2Web 的 136 个真实网站中，WebNavigator 使用 Gemini-2.5-Pro 达到 52.7%，确立了强泛化性。
该方法使用六动作界面（navigate(domain,query)）和 Retrieve-Reason-Teleport 工作流，将导航转化为对交互图的确定性路径搜索。
后期交互检索（Token 级）优于密集嵌入方法的检索，体现细粒度视觉-语义匹配的重要性。
经验性消融研究表明环境知识的完整性（深度）和信息带宽（k）对性能影响显著，充分探索和稳健的 6-动作设计后收益递减。

Figure 2: Trajectory comparison on a multi-site task (WebArena 760), which requires retrieving a specific customer address from the CMS to plan a route on the Map. WebNavigator achieves human-level planning via two navigate(domain, query) actions, whereas the ReAct baseline prematurely terminates du

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。