📊 MOC — LLM Benchmarks, PM-AI Convergence & Agent Architectures

Overview

Navigation map for all literature on LLM agents, PM-AI convergence, benchmarks, and agent architectures. Updated with verified references from bibliography supplement v3.

🎯 PUMA Core Papers (highest relevance)

Paper	arXiv	Key Contribution	PUMA Stage
Cinkusz et al. 2025 (Cognitive Agents PM)	2508.16678	5-task PM benchmark with LLMs; 3 limitations PUMA addresses	1–3
Assalaarachchi et al. 2026 (Agentic SPM)	2601.16392	”Agentic PM” vision; SPM 3.0 framework	5
Yao et al. 2022 (ReAct)	2210.03629	Base agent pattern: Thought-Action-Observation	4–5
Arora et al. 2024 (MASAI)	2406.11638	Modular sub-agent architecture for SE	4–5
Gao et al. 2024 (AgentScope)	2402.14034	Multi-agent platform; native Ollama support	5
Dorri et al. 2025 (Orchestrating HA Teams)	2510.02557	Manager Agent framework; GPT-5 vs GPT-4.1	5

🏗️ Foundational LLM Architecture

LN-Vaswani-2017-AttentionIsAllYouNeed — Transformer architecture: attention mechanism, the backbone of all modern LLMs (NeurIPS 2017)
LN-Fedus-2022-SwitchTransformers — MoE: sparse routing, expert capacity, DeepSeek-V3/Mixtral lineage (JMLR 2022)
LN-Wei-2022-ChainOfThought — CoT prompting: reasoning chains emerge at scale >100B parameters (NeurIPS 2022)

🤖 Agent Architectures

Foundation Papers

LN-Yao-2022-ReAct — ReAct: Thought-Action-Observation loop
LN-Yao-2023-TreeOfThoughts — ToT: Tree search over reasoning
LN-Zelikman-2024-QuietSTaR — Quiet-STaR: why CoT works mechanistically

Multi-Agent Frameworks

LN-Hong-2023-MetaGPT — MetaGPT: role-based multi-agent
LN-Qian-2023-ChatDev — ChatDev: chat-based agent teams
LN-Wu-2023-AutoGen — AutoGen: conversational multi-agent
LN-Gao-2024-AgentScope — AgentScope: robust platform (Ollama native)
LN-Talebirad-2023-MultiAgentSurvey — Survey: MAS communication patterns

Software Engineering Agents

LN-Arora-2024-MASAI — MASAI: modular sub-agents (SOTA SWE-bench)
LN-Jimenez-2023-SWEbench — SWE-bench: real GitHub issue benchmark
LN-Wang-2024-OpenHands — OpenHands: open agentic platform
LN-MAGIS-2024-GitHubIssues — MAGIS: multi-agent GitHub issue resolution

Architecture Surveys & Taxonomies

LN-Masterman-2024-AgentArchSurvey — Landscape of AI agent architectures
LN-Ning-2025-AgentTaxonomy — Taxonomy + decision model for agent design
LN-Tang-2025-LLMOrbit — LLMOrbit: scaling to agentic AI taxonomy

Self-Improvement & Reasoning

LN-Shinn-2023-Reflexion — Reflexion: verbal self-reflection loop (NeurIPS 2023)
LN-Liu-2023-AgentBench — AgentBench: 8-environment benchmark, open-source vs GPT-4 gap (ICLR 2024)
LN-Park-2023-GenerativeAgents — Generative Agents: memory stream + reflection + planning (UIST 2023)
LN-Xie-2023-OpenAgents — OpenAgents: Data/Plugins/Web agent triad

Memory & State

LN-Packer-2023-MemGPT — MemGPT: virtual context management
LN-AssistGUI-2023 — AssistGUI: GUI automation for PM tool integration

Collaboration & Coordination

LN-Huang-2024-InternetOfAgents — Internet of Agents: heterogeneous coordination
LN-HiveMind-2025-SwarmOptimization — HiveMind: swarm-level optimisation
LN-OrchestratingHumanAI-2025 — Manager Agent as unifying challenge

Workflow & Orchestration

LN-Yu-2025-DynTaskMAS — DynTaskMAS: async parallel task graphs
LN-Flow-2025-AgenticWorkflow — Flow: modular workflow composition
LN-IntelligentSparkAgents-2024 — LangGraph in practice

Security & Governance

LN-AuthenticatedWorkflows-2026 — Authenticated Workflows: protecting agentic AI
LN-Hou-2025-MCP-Security — MCP: landscape and security threats
LN-HAIF-2026-HumanAIIntegration — HAIF: human-AI integration governance

📋 PM-AI Convergence

Agentic PM Vision

LN-Assalaarachchi-2026-AgenticSPM — ⭐ SPM 3.0: Agentic PM vision
LN-Cinkusz-2025-CognitiveAgentsAgilePM — ⭐ PUMA predecessor paper
LN-Shao-2025-FutureOfWork — 40% PM tasks automatable

AIOps & DevOps

LN-Chen-2025-AIOpsLab — AIOpsLab: AIOps benchmark (PUMA analogue)
LN-Bruneliere-2022-AIDOaRt — AIDOaRt: AI-augmented DevOps framework
LN-Chen-2024-RootCauseAnalysis — Root cause analysis via LLMs (Microsoft production)
LN-Incident-Management-AI-2023 — Incident Management survey: triage lifecycle (arXiv 2312.14411)

PM-AI & Human Collaboration

LN-LLM-MAS-SE-2024 — LLM-MAS for SE: PM agent taxonomy, evaluation gap (ACM TOSEM)
LN-Flyvbjerg-2023-UniquenessTrap — Uniqueness Trap: PUMA’s theoretical motivation (RCF)
LN-Collaborating-AIAgents-2025 — Field experiments: human-AI team productivity (arXiv 2503.18238)
LN-Hubinger-2019-LearnedOptimization — Risks from Learned Optimization: inner alignment, HITL basis

📊 Benchmarks & Evaluation

LN-Jimenez-2023-SWEbench — SWE-bench (code fixing baseline)
LN-Mialon-2023-GAIA — GAIA (general AI assistant benchmark)
LN-Chen-2025-AIOpsLab — AIOpsLab (AIOps benchmark)

🔍 AI Code Quality & Code Review

LN-CodeRabbit-2025-AIvsHumanCode — ⭐ CodeRabbit (2025): AI vs Human PRs — 470 PRs, 1.7× more issues, 2.74× security CVEs, 3× readability deficit, 7 mitigation strategies

PN-KeyConcepts-Agents-Reproducibility-RedTeam — Agent OS, Reproducibility
PN-RAG-Embeddings-VectorDB — Stage 4 RAG
PN-LLM-Local-vs-Cloud — Local vs cloud tradeoff
PN-MultiAgent-ArchitecturePatterns — Specialisation evidence
PN-ReAct-AgentPattern — ReAct foundation
PN-IssueTriage-StoryPoints — PUMA target tasks
Smart-PMO-Vision — Stage 5 future
PN-Evaluation-Metrics-Comprehensive — F1, MAE, SA, SPR, Wilcoxon, CI
PN-StatisticalValidation-Full — Wilcoxon, bootstrap, effect size
PN-LLM-Models-PUMA — Model comparison: Llama, Mistral, GPT-4o, DeepSeek
PN-Reflexion-SelfCritique — Verbal reinforcement loop
PN-GenerativeAgents-Simulacra — Memory stream architecture
PN-HITL-BoundedAutonomy — Human oversight design
PN-UniquenessTrap — Reference class forecasting in PUMA
PN-COSTAR-SelfConsistency — Prompt engineering frameworks
PN-AlgorithmicBias — Fairness and bias in PM AI
PN-ComputationalSustainability — Carbon footprint tracking
PN-FineTuning-LoRA-Quantization — LoRA, QLoRA, GGUF

PUMA Vault

Explorador

📊 MOC — LLM Benchmarks, PM-AI Convergence & Agent Architectures (v2)

📊 MOC — LLM Benchmarks, PM-AI Convergence & Agent Architectures

🎯 PUMA Core Papers (highest relevance)

🏗️ Foundational LLM Architecture

🤖 Agent Architectures

Foundation Papers

Multi-Agent Frameworks

Software Engineering Agents

Architecture Surveys & Taxonomies

Self-Improvement & Reasoning

Memory & State

Collaboration & Coordination

Workflow & Orchestration

Security & Governance

📋 PM-AI Convergence

Agentic PM Vision

AIOps & DevOps

PM-AI & Human Collaboration

📊 Benchmarks & Evaluation

🔍 AI Code Quality & Code Review

Closure: empirical results & models

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

📊 MOC — LLM Benchmarks, PM-AI Convergence & Agent Architectures (v2)

📊 MOC — LLM Benchmarks, PM-AI Convergence & Agent Architectures

🎯 PUMA Core Papers (highest relevance)

🏗️ Foundational LLM Architecture

🤖 Agent Architectures

Foundation Papers

Multi-Agent Frameworks

Software Engineering Agents

Architecture Surveys & Taxonomies

Self-Improvement & Reasoning

Memory & State

Collaboration & Coordination

Workflow & Orchestration

Security & Governance

📋 PM-AI Convergence

Agentic PM Vision

AIOps & DevOps

PM-AI & Human Collaboration

📊 Benchmarks & Evaluation

🔍 AI Code Quality & Code Review

🔗 Related Permanent Notes

🔗 Related MOCs

Closure: empirical results & models

Vista Gráfica

Tabla de Contenidos

Retroenlaces