iconLogo
Published:2025/10/23 7:18:32

タイトル: GUI操作を賢く!AIエージェント強化✨

超要約: 命令を賢く解釈!GUI操作AIを爆速進化させる研究🚀


💎 ギャル的キラキラポイント✨ ● 命令を色んな角度から解釈する「Instruction-as-Reasoning」ってのが新しい! ● GUI操作が超絶スムーズになるから、色んなサービスで活躍できる予感💖 ● AIアシスタントとか、未来のビジネスチャンスが無限大だってこと😍


続きは「らくらく論文」アプリで

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Liangyu Chen / Hanzhang Zhou / Chenglin Cai / Jianan Zhang / Panrong Tong / Quyu Kong / Xu Zhang / Chen Liu / Yuqi Liu / Wenxuan Wang / Yue Wang / Qin Jin / Steven Hoi

GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.

cs / cs.CV / cs.AI