QUICK REVIEW

[论文解读] AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

SiQi Pei, Liang Tang|arXiv (Cornell University)|Mar 18, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

AdaZoom-GUI 将指令 refinement 模块与条件缩放定位策略结合，在比对规模相当的模型上，通过 GRPO 训练实现对 GUI 元素及其边界框的定位，在高分辨率 GUI 基准上达到最先进的性能。

ABSTRACT

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.

研究动机与目标

促使对高分辨率截图和小型 UI 元素的鲁棒 GUI 定位。
通过将自然语言命令改写为明确、详细的描述来提升指令理解。
通过一个条件（自适应）缩放策略，在需要时才激活第二轮推理以提升定位准确性。
使用高质量 GUI 数据集和 GRPO 训练定位模型，以预测点击坐标和元素边界框。
在参数量相近或更大的状态-of-the-art GUI 定位模型面前展现强劲的经验性性能。

提出的方法

引入一个指令 refinement 模块，将指令改写为明确、详尽的命令（如位置、视觉特征等）。
使用一个定位模型，从 refined 指令和 GUI 截图中同时输出点击点和目标元素的边界框。
应用条件缩放策略，只有当预测的边界框较小时才触发第二轮推理，以在简单场景中保持上下文信息。
使用 GRPO 训练定位模型，采用将点击点和边界框目标结合的奖励。
构建高质量的 GUI 定位数据集，使用大语言模型改写指令并对图像进行尺寸调整/填充以处理不同分辨率。

实验结果

研究问题

RQ1指令 refinement 是否通过使目标描述更明确来提升 GUI 定位性能？
RQ2条件缩放策略在高分辨率与低分辨率 GUI 场景中是否在定位精度和计算效率之间取得平衡？
RQ3在 GUI 定位中，GRPO 指导的训练在预测点击坐标和元素边界框方面的表现如何？
RQ4将指令 refinement 与自适应缩放相结合，对同等规模的 GUI 定位模型能否带来与最先进方法的显著性能提升？

主要发现

将指令 refinement 与条件缩放结合，在 ScreenSpot-Pro 上达到同等规模模型中的最新性能。
条件缩放在 ScreenSpot-v2 上比无条件缩放具有更高的准确性，证明了自适应策略的必要性。
使用 refinement 模型进行定位在应用缩放之前就已提升平均分数，体现了更强指令理解的好处。
完整的 AdaZoom-GUI 流水线（ refinement + 定位 + 条件缩放）相较于仅基线定位有显著提升，并在若干更大模型的基准测试中超过了它们。
使用 GRPO 训练能够实现点击点与边界框预测的联合优化，与双输出定位目标一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。