QUICK REVIEW

[논문 리뷰] UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye|ArXiv.org|2025. 01. 21.

Multi-Agent Systems and Negotiation인용 수 4

한 줄 요약

UI-TARS는 스크린샷을 통해 GUI를 인지하고, 시스템-1과 시스템-2 사고로 추론하며, 온라인 추적에서 점진적으로 학습하여 프레임워크 기반 모델을 능가하는 엔드투엔드 네이티브 GUI 에이전트를 제시한다.

ABSTRACT

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

연구 동기 및 목표

규칙 기반 및 프레임워크 기반 GUI 에이전트에서 네이티브 엔드투엔드 GUI 에이전트 모델로의 전환을 촉진한다.
네이티브 GUI 에이전트의 핵심 능력(지각/인식, 동작/행동, 추론, 기억)을 정의한다.
지각 향상, 통합된 행동 공간, 시스템-2 추론, 그리고 반복적인 온라인 학습을 갖춘 확장 가능한 구현으로 UI-TARS를 제안한다.

제안 방법

스크린샷을 입력으로 받고 구체적인 행동으로 대응하는 순수 비전 GUI 에이전트를 개발한다.
요소 설명, 밀도 캡션, 상태 전이 캡션, QA, 그리고 세트-오브-프롬프트를 포함한 대규모 GUI 스크린샷 데이터셋을 통해 향상된 인식을 구축한다.
플랫폼 간 행동 표준화를 위한 통합된 행동 공간을 확립하고, 근거를 위한 대규모 행동 추적 데이터셋을 구축한다.
의사결정에 의도적인 사고와 다양한 추론 패턴을 주입하여 시스템-2 추론을 도입한다.
수백 대의 가상 머신으로부터 얻은 반성적 추적을 다듬기 위해 반성적 추적, 필터링 및 Direct Preference Optimization(DPO)을 사용한 반복적인 온라인 학습을 구현한다.

실험 결과

연구 질문

RQ1순수 네이티브 엔드투엔드 GUI 에이전트가 지각, 근거화 및 작업 수행 벤치마크에서 모듈식 프레임워크 기반 에이전트들을 능가할 수 있는가?
RQ2향상된 GUI 인식, 통합된 행동 모델링 및 시스템-2 추론이 데스크톱, 웹, 모바일 GUI 전반의 성능 향상에 어떻게 기여하는가?
RQ3온라인 추적에서의 반성적 학습이 견고성 및 미확인 인터페이스에 대한 일반화를 향상시키는가?

주요 결과

UI-TARS는 지각, 근거화 및 에이전트 실행에 대해 10개 이상의 GUI 에이전트 벤치마크에서 최첨단 성능을 달성했다.
UI-TARS-72B은 VisualWebBench에서 82.8점을 기록해 GPT-4o의 78.5를 능가한다.
OSWorld에서 UI-TARS-72B는 24.6(50단계)와 22.7(15단계)를 기록하여 Claude의 22.0 및 14.9를 각각 상회한다.
AndroidWorld에서 UI-TARS는 46.6점을 기록해 GPT-4o의 34.5를 능가한다.
지각과 근거화는 모바일, 데스크톱, 웹 환경 전반에서 높은 정밀도를 달성하며, 구체적인 예로 ScreenSpot Pro에서 38.1과 같은 수치가 주어졌다.
실험 결과 72B 변형이 다단계 및 동적 작업에서 뛰어나 시스템-2 추론 및 온라인 정제 설계를 검증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.