LN: Gao et al. (2023) — AssistGUI: Task-Oriented Desktop GUI Automation

Bibliographic Reference

Citation: Gao, D., Ji, L., Bai, Z., Chen, M., Li, Z., Zeng, K., & Ma, L. (2023). AssistGUI: Task-oriented desktop graphical user interface automation. arXiv:2312.03723. https://arxiv.org/abs/2312.03723


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategorySystem proposal + empirical evaluation
ContextGUI automation using LLMs with visual understanding; addresses limitations of purely text-based agents for desktop applications
CorrectnessEvaluated on a benchmark of desktop tasks; compared against GPT-4V and other baselines
Contributions(1) Task-oriented GUI automation that understands natural language instructions and executes desktop actions; (2) A benchmark of office productivity tasks; (3) Memory and planning mechanisms for multi-step workflows
ClarityGood. Task taxonomy and pipeline are clear.

Relevance: ⭐⭐⭐

Relevant as a reference for PUMA Stage 5 SmartPMO — automating PM tool interactions (Jira, GitHub) via GUI agents when APIs are unavailable.


Pass 2 — Content

Architecture

AssistGUI operates through a pipeline:

  1. Task Understanding: Parse natural language instruction into goal state
  2. Screen Perception: Detect UI elements (buttons, inputs, menus) via visual understanding
  3. Action Planning: Generate a plan of atomic UI actions (click, type, scroll)
  4. Execution: Execute actions in the OS environment
  5. Verification: Check if the goal state has been reached; retry on failure

Action Space

ActionDescription
click(element)Click on detected UI element
type(text)Type text into active field
scroll(direction)Scroll in given direction
key(shortcut)Press keyboard shortcut
drag(src, dst)Drag from source to destination

Key Findings

  • LLMs with visual understanding (GPT-4V) significantly outperform text-only approaches for GUI tasks
  • Multi-step tasks requiring state tracking are the primary failure mode
  • Office productivity tasks (Word, Excel, email) benefit from explicit planning

PUMA Integration

  • SmartPMO automation: A GUI agent could automate Jira ticket processing when the Jira REST API is unavailable or insufficient — reading, classifying, and updating tickets directly via the UI
  • Accessibility: For legacy PM tools without APIs, GUI automation is the only integration path
  • Stage 5 extension: Smart-PMO-Vision — GUI automation as a SmartPMO capability tier

MOCs