π MOC β LLM Benchmarks, PM-AI Convergence & Agent Architectures
Overview
Navigation map for all literature on LLM agents, PM-AI convergence, benchmarks, and agent architectures. Updated with verified references from bibliography supplement v3.
π― PUMA Core Papers (highest relevance)
| Paper | arXiv | Key Contribution | PUMA Stage |
|---|---|---|---|
| Cinkusz et al. 2025 (Cognitive Agents PM) | 2508.16678 | 5-task PM benchmark with LLMs; 3 limitations PUMA addresses | 1β3 |
| Assalaarachchi et al. 2026 (Agentic SPM) | 2601.16392 | βAgentic PMβ vision; SPM 3.0 framework | 5 |
| Yao et al. 2022 (ReAct) | 2210.03629 | Base agent pattern: Thought-Action-Observation | 4β5 |
| Arora et al. 2024 (MASAI) | 2406.11638 | Modular sub-agent architecture for SE | 4β5 |
| Gao et al. 2024 (AgentScope) | 2402.14034 | Multi-agent platform; native Ollama support | 5 |
| Dorri et al. 2025 (Orchestrating HA Teams) | 2510.02557 | Manager Agent framework; GPT-5 vs GPT-4.1 | 5 |
ποΈ Foundational LLM Architecture
- LN-Vaswani-2017-AttentionIsAllYouNeed β Transformer architecture: attention mechanism, the backbone of all modern LLMs (NeurIPS 2017)
- LN-Fedus-2022-SwitchTransformers β MoE: sparse routing, expert capacity, DeepSeek-V3/Mixtral lineage (JMLR 2022)
- LN-Wei-2022-ChainOfThought β CoT prompting: reasoning chains emerge at scale >100B parameters (NeurIPS 2022)
π€ Agent Architectures
Foundation Papers
- LN-Yao-2022-ReAct β ReAct: Thought-Action-Observation loop
- LN-Yao-2023-TreeOfThoughts β ToT: Tree search over reasoning
- LN-Zelikman-2024-QuietSTaR β Quiet-STaR: why CoT works mechanistically
Multi-Agent Frameworks
- LN-Hong-2023-MetaGPT β MetaGPT: role-based multi-agent
- LN-Qian-2023-ChatDev β ChatDev: chat-based agent teams
- LN-Wu-2023-AutoGen β AutoGen: conversational multi-agent
- LN-Gao-2024-AgentScope β AgentScope: robust platform (Ollama native)
- LN-Talebirad-2023-MultiAgentSurvey β Survey: MAS communication patterns
Software Engineering Agents
- LN-Arora-2024-MASAI β MASAI: modular sub-agents (SOTA SWE-bench)
- LN-Jimenez-2023-SWEbench β SWE-bench: real GitHub issue benchmark
- LN-Wang-2024-OpenHands β OpenHands: open agentic platform
- LN-MAGIS-2024-GitHubIssues β MAGIS: multi-agent GitHub issue resolution
Architecture Surveys & Taxonomies
- LN-Masterman-2024-AgentArchSurvey β Landscape of AI agent architectures
- LN-Ning-2025-AgentTaxonomy β Taxonomy + decision model for agent design
- LN-Tang-2025-LLMOrbit β LLMOrbit: scaling to agentic AI taxonomy
Self-Improvement & Reasoning
- LN-Shinn-2023-Reflexion β Reflexion: verbal self-reflection loop (NeurIPS 2023)
- LN-Liu-2023-AgentBench β AgentBench: 8-environment benchmark, open-source vs GPT-4 gap (ICLR 2024)
- LN-Park-2023-GenerativeAgents β Generative Agents: memory stream + reflection + planning (UIST 2023)
- LN-Xie-2023-OpenAgents β OpenAgents: Data/Plugins/Web agent triad
Memory & State
- LN-Packer-2023-MemGPT β MemGPT: virtual context management
- LN-AssistGUI-2023 β AssistGUI: GUI automation for PM tool integration
Collaboration & Coordination
- LN-Huang-2024-InternetOfAgents β Internet of Agents: heterogeneous coordination
- LN-HiveMind-2025-SwarmOptimization β HiveMind: swarm-level optimisation
- LN-OrchestratingHumanAI-2025 β Manager Agent as unifying challenge
Workflow & Orchestration
- LN-Yu-2025-DynTaskMAS β DynTaskMAS: async parallel task graphs
- LN-Flow-2025-AgenticWorkflow β Flow: modular workflow composition
- LN-IntelligentSparkAgents-2024 β LangGraph in practice
Security & Governance
- LN-AuthenticatedWorkflows-2026 β Authenticated Workflows: protecting agentic AI
- LN-Hou-2025-MCP-Security β MCP: landscape and security threats
- LN-HAIF-2026-HumanAIIntegration β HAIF: human-AI integration governance
π PM-AI Convergence
Agentic PM Vision
- LN-Assalaarachchi-2026-AgenticSPM β β SPM 3.0: Agentic PM vision
- LN-Cinkusz-2025-CognitiveAgentsAgilePM β β PUMA predecessor paper
- LN-Shao-2025-FutureOfWork β 40% PM tasks automatable
AIOps & DevOps
- LN-Chen-2025-AIOpsLab β AIOpsLab: AIOps benchmark (PUMA analogue)
- LN-Bruneliere-2022-AIDOaRt β AIDOaRt: AI-augmented DevOps framework
- LN-Chen-2024-RootCauseAnalysis β Root cause analysis via LLMs (Microsoft production)
- LN-Incident-Management-AI-2023 β Incident Management survey: triage lifecycle (arXiv 2312.14411)
PM-AI & Human Collaboration
- LN-LLM-MAS-SE-2024 β LLM-MAS for SE: PM agent taxonomy, evaluation gap (ACM TOSEM)
- LN-Flyvbjerg-2023-UniquenessTrap β Uniqueness Trap: PUMAβs theoretical motivation (RCF)
- LN-Collaborating-AIAgents-2025 β Field experiments: human-AI team productivity (arXiv 2503.18238)
- LN-Hubinger-2019-LearnedOptimization β Risks from Learned Optimization: inner alignment, HITL basis
π Benchmarks & Evaluation
- LN-Jimenez-2023-SWEbench β SWE-bench (code fixing baseline)
- LN-Mialon-2023-GAIA β GAIA (general AI assistant benchmark)
- LN-Chen-2025-AIOpsLab β AIOpsLab (AIOps benchmark)
π AI Code Quality & Code Review
- LN-CodeRabbit-2025-AIvsHumanCode β β CodeRabbit (2025): AI vs Human PRs β 470 PRs, 1.7Γ more issues, 2.74Γ security CVEs, 3Γ readability deficit, 7 mitigation strategies
π Related Permanent Notes
- PN-KeyConcepts-Agents-Reproducibility-RedTeam β Agent OS, Reproducibility
- PN-RAG-Embeddings-VectorDB β Stage 4 RAG
- PN-LLM-Local-vs-Cloud β Local vs cloud tradeoff
- PN-MultiAgent-ArchitecturePatterns β Specialisation evidence
- PN-ReAct-AgentPattern β ReAct foundation
- PN-IssueTriage-StoryPoints β PUMA target tasks
- Smart-PMO-Vision β Stage 5 future
- PN-Evaluation-Metrics-Comprehensive β F1, MAE, SA, SPR, Wilcoxon, CI
- PN-StatisticalValidation-Full β Wilcoxon, bootstrap, effect size
- PN-LLM-Models-PUMA β Model comparison: Llama, Mistral, GPT-4o, DeepSeek
- PN-Reflexion-SelfCritique β Verbal reinforcement loop
- PN-GenerativeAgents-Simulacra β Memory stream architecture
- PN-HITL-BoundedAutonomy β Human oversight design
- PN-UniquenessTrap β Reference class forecasting in PUMA
- PN-COSTAR-SelfConsistency β Prompt engineering frameworks
- PN-AlgorithmicBias β Fairness and bias in PM AI
- PN-ComputationalSustainability β Carbon footprint tracking
- PN-FineTuning-LoRA-Quantization β LoRA, QLoRA, GGUF