QUICK REVIEW

[論文レビュー] AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

SiQi Pei, Liang Tang|arXiv (Cornell University)|Mar 18, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

AdaZoom-GUIは指示の洗練モジュールと条件付きのズームイン・ grounding戦略を組み合わせ、GRPOでGUI要素と境界ボックスを局所化する訓練を行い、同程度のサイズのモデルにおける高解像度GUIベンチマークで最先端の性能を達成します。

ABSTRACT

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.

研究の動機と目的

高解像度のスクリーンショットと小さなUI要素を扱う頑健なGUI groundingを動機づける。
自然言語コマンドを明示的かつ詳細な説明へ書き換えることで指示理解を向上させる。
必要時のみ作動する条件付き（適応的）ズームイン戦略で局在精度を向上させる。
高品質なGUIデータセットとGRPOを用いて、クリック座標と要素境界ボックスの予測を行う groundingモデルを訓練する。
同程度またはそれ以上のパラメータ数を持つ最先端GUI groundingモデルに対して、強力な実証的性能を示す。

提案手法

指示を明示的で詳細な命令（例：位置、視覚特徴）へ書き直す指示洗練モジュールを導入する。
refined指示とGUIスクリーンショットからクリックポイントとターゲット要素境界ボックスの両方を出力する groundingモデルを使用する。
予測されたボックスが小さい場合にのみ第2回推論ラウンドをトリガーする条件付きズームイン戦略を適用し、単純なケースでは文脈を保持する。
クリック点と境界ボックスの予測を同時に最適化する報酬を組み合わせたGRPOを用いて groundingモデルを訓練する。
高品質なGUI groundingデータセットを構築し、LMMベースのバリエーションで指示を拡張し、画像を様々な解像度に対応するようリサイズ/パディングする。

実験結果

リサーチクエスチョン

RQ1指示洗練は対象記述をより明確にすることでGUI groundingの性能を向上させるか。
RQ2条件付きズームイン戦略は高解像度と低解像度のGUIシナリオで局在精度と計算効率のバランスを取れるか。
RQ3GRPOガイド訓練は GUI grounding におけるクリック座標と要素境界ボックスの予測にどのような効果を示すか。
RQ4指示洗練と適応的ズーミングを組み合わせることで、同等サイズの最先端モデルに対する性能がどう変化するか。

主な発見

指示洗練と条件付きズームインの組み合わせは、ScreenSpot-Proにおいて同程度のサイズのモデルの中で最先端の性能を達成する。
条件付きズームインはScreenSpot-v2で無条件ズームよりも高い精度を示し、適応戦略の必要性を示す。
洗練モデルを grounding に組み合わせると、ズームイン適用前でも平均スコアが改善され、指示理解の強化の恩恵が見える。
完全なAdaZoom-GUIパイプライン（洗練＋ grounding＋条件付きズーム）は、基盤 grounding のみと比べて大幅な改善をもたらし、複数の大規模モデルを上回るベンチマーク結果を示す。
GRPOを用いた訓練はクリック点と境界ボックス予測を同時最適化でき、デュアル出力 groundingの目的に適合する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。