Multi-agent systems outperform single agents on PM tasks when agent roles match task specialisation boundaries

A single general-purpose LLM accumulates error when acting simultaneously as domain expert, planner, executor, and quality checker. Decomposing these roles into specialised agents with narrow objectives reduces error propagation.


The Core Claim

Multiple specialist agents — each with a narrow objective, dedicated tools, and domain-specific context — outperform a single agent trying to handle everything. This is not universally true (overhead exists), but it holds when:

  1. Tasks have distinct information requirements
  2. Tasks have distinct decision criteria
  3. Tasks have distinct failure modes

Evidence

MASAI (Arora et al., 2024): Modular sub-agents achieve 28.33% on SWE-bench Lite vs. single-agent baselines that achieve <15%. Each sub-agent (repository curator, fault localiser, patch generator) has one objective.

MetaGPT (Hong et al., 2023): Role-based agents (PM → Architect → Engineer → QA) with artifact-driven handoffs produce higher-quality software than unstructured agent conversations.

Orchestrating Human-AI Teams (Dorri et al., 2025): GPT-5 as Manager Agent outperforms GPT-4.1 by decomposing tasks proactively. Reactive communication loops underperform.


Application to PUMA

PUMA Stage 1–3 tests the single-agent hypothesis: can one LLM (prompted correctly) match a specialist? This establishes the baseline.

PUMA Stage 5 (Smart PMO) tests the multi-agent hypothesis: does decomposing into (Triage Agent + Estimation Agent + Planning Agent + Risk Agent) + Manager Agent outperform the Stage 1–3 single agents?

Architecture rules for PUMA multi-agent design:

  1. Each agent has exactly one PM task type
  2. Agents share no context directly — only through the Manager Agent
  3. Manager Agent decomposes backlog → assigns tasks → aggregates results
  4. No agent modifies another agent’s output without Manager oversight (HITL principle)

References

MOCs