LN: Gao et al. (2023) — AssistGUI: Task-Oriented Desktop GUI Automation
Bibliographic Reference
Citation: Gao, D., Ji, L., Bai, Z., Chen, M., Li, Z., Zeng, K., & Ma, L. (2023). AssistGUI: Task-oriented desktop graphical user interface automation. arXiv:2312.03723. https://arxiv.org/abs/2312.03723
Pass 1 — Bird’s Eye View (5 Cs)
| C | Assessment |
|---|---|
| Category | System proposal + empirical evaluation |
| Context | GUI automation using LLMs with visual understanding; addresses limitations of purely text-based agents for desktop applications |
| Correctness | Evaluated on a benchmark of desktop tasks; compared against GPT-4V and other baselines |
| Contributions | (1) Task-oriented GUI automation that understands natural language instructions and executes desktop actions; (2) A benchmark of office productivity tasks; (3) Memory and planning mechanisms for multi-step workflows |
| Clarity | Good. Task taxonomy and pipeline are clear. |
Relevance: ⭐⭐⭐
Relevant as a reference for PUMA Stage 5 SmartPMO — automating PM tool interactions (Jira, GitHub) via GUI agents when APIs are unavailable.
Pass 2 — Content
Architecture
AssistGUI operates through a pipeline:
- Task Understanding: Parse natural language instruction into goal state
- Screen Perception: Detect UI elements (buttons, inputs, menus) via visual understanding
- Action Planning: Generate a plan of atomic UI actions (click, type, scroll)
- Execution: Execute actions in the OS environment
- Verification: Check if the goal state has been reached; retry on failure
Action Space
| Action | Description |
|---|---|
click(element) | Click on detected UI element |
type(text) | Type text into active field |
scroll(direction) | Scroll in given direction |
key(shortcut) | Press keyboard shortcut |
drag(src, dst) | Drag from source to destination |
Key Findings
- LLMs with visual understanding (GPT-4V) significantly outperform text-only approaches for GUI tasks
- Multi-step tasks requiring state tracking are the primary failure mode
- Office productivity tasks (Word, Excel, email) benefit from explicit planning
PUMA Integration
- SmartPMO automation: A GUI agent could automate Jira ticket processing when the Jira REST API is unavailable or insufficient — reading, classifying, and updating tickets directly via the UI
- Accessibility: For legacy PM tools without APIs, GUI automation is the only integration path
- Stage 5 extension: Smart-PMO-Vision — GUI automation as a SmartPMO capability tier
Related Notes
- PN-MultiAgent-ArchitecturePatterns — GUI agent as a specialized tool-use agent
- LN-Park-2023-GenerativeAgents — agent with memory and planning
- LN-Xie-2023-OpenAgents — complementary web-based agent