Knowledge Coverage Report — PUMA Vault Expansion (April 2026)

Purpose

This report documents the systematic knowledge expansion of the PUMA vault conducted in April 2026. It identifies what was missing, what was added, remaining gaps, and proposed research directions.


Executive Summary

MetricBefore ExpansionAfter Expansion (Apr)After Expansion (May)Delta Total
Literature notes (LN-)~95105105+10
Permanent notes (PN-)~152631+16
Book notes044+4
MOC entries updated23

May 2026 Update

5 high-priority permanent notes created (May 2026): PN-ContextEngineering, PN-MCP-ModelContextProtocol, PN-TreeOfThoughts-Deliberate, PN-RLHF-Constitutional, PN-Transformer-MoE. MOC-Methods-Frameworks updated with all 5 entries.


What Was Added

New Literature Notes

FileTopicKey Contribution to PUMA
LN-Shinn-2023-ReflexionAgent self-improvementReflexion loop → Stage 4 iterative triage
LN-Liu-2023-AgentBenchAgent benchmarkingSPR metric; open-source model capability gap
LN-Park-2023-GenerativeAgentsMemory-augmented agentsSmartPMO persistent memory architecture
LN-Xie-2023-OpenAgentsAgent triadsData Agent → TAWOS processing pipeline
LN-Hubinger-2019-LearnedOptimizationAI safetyInner alignment → HITL theoretical basis
LN-Incident-Management-AI-2023AIOps surveyMTTD/MTTR metrics; triage lifecycle positioning
LN-AssistGUI-2023GUI automationStage 5 Jira integration without REST API
LN-Collaborating-AIAgents-2025Field experimentsEcological validity for augmentation debate
LN-LLM-MAS-SE-2024MAS for SEPM agent taxonomy; evaluation gap PUMA fills
LN-Flyvbjerg-2023-UniquenessTrapPM theoryUniqueness Trap → PUMA’s motivating problem

New Book Notes

FileAuthorKey Contribution to PUMA
LN-Beck-1999-XPExplainedKent BeckStory points origin; planning game history
LN-Lawrence-2024-AtomicHumanNeil D. LawrenceAtomic human concept; HITL theoretical framing
LN-Suleiman-2023-ComingWaveMustafa SuleimanMacro AI policy context for ethics chapter
LN-Shum-2025-PensarConPromptsJan ShumCO-STAR framework; prompt engineering taxonomy

New Permanent Notes

FileTopicScope
PN-Evaluation-Metrics-ComprehensiveAll PUMA metricsF1, MAE, SA, SPR, Wilcoxon, CI, BLEU, ROUGE, CO₂
PN-StatisticalValidation-FullStatisticsWilcoxon pipeline, bootstrap CI, multiple comparison
PN-LLM-Models-PUMAModel catalogLlama, Mistral, Phi, Gemma, DeepSeek, GPT-4o, Claude
PN-Reflexion-SelfCritiqueAgent patternVerbal reinforcement loop + PUMA implementation
PN-HITL-BoundedAutonomyEthics/SafetyBounded autonomy framework + EU AI Act context
PN-ComputationalSustainabilitySustainabilityCodeCarbon integration + CO₂eq reporting
PN-UniquenessTrapPM theoryRCF + PUMA thesis framing
PN-FineTuning-LoRA-QuantizationLLM trainingLoRA, QLoRA, GGUF quantization reference
PN-AlgorithmicBiasEthics/FairnessBias taxonomy + detection + mitigation for PUMA
PN-COSTAR-SelfConsistencyPromptingCO-STAR + CoT + Self-Consistency + SPR strategies
PN-GenerativeAgents-SimulacraAgent architectureMemory stream + reflection + planning for SmartPMO

What Was Missing (and Is Now Covered)

Critical Gaps Filled

1. Evaluation Metrics (previously: only informal mentions scattered across experiment notes) → Now: PN-Evaluation-Metrics-Comprehensive with formal definitions, Python code, PUMA applicability table, and metric selection guide

2. Statistical Validation (previously: no systematic treatment) → Now: PN-StatisticalValidation-Full with normality testing, Wilcoxon pipeline, bootstrap CI, multiple comparison correction, and a complete reporting template

3. Model Technical Reference (previously: model names mentioned but no specifications) → Now: PN-LLM-Models-PUMA with architecture details, quantization options, VRAM requirements, Ollama identifiers, and selection rationale

4. Prompting Frameworks (previously: techniques used but not systematically documented) → Now: PN-COSTAR-SelfConsistency with full CO-STAR template, CoT/Few-Shot/Self-Consistency comparison, structured output strategies, and prompt anti-pattern checklist

5. Uniqueness Trap Theory (previously: referenced in Flyvbjerg note but no permanent synthesis) → Now: PN-UniquenessTrap with RCF algorithm, PUMA few-shot mapping, and Ch.1 thesis framing

6. AI Safety/HITL Theoretical Basis (previously: mentioned but not formally documented) → Now: PN-HITL-BoundedAutonomy with bounded autonomy framework, mesa-optimization risk, EU AI Act mapping, and PUMA design protocol

7. Carbon Footprint Tracking (previously: not documented) → Now: PN-ComputationalSustainability with CodeCarbon integration, emission formulas, hardware baselines, and cloud vs. local comparison

8. Agent Self-Improvement (previously: no dedicated note on Reflexion pattern) → Now: PN-Reflexion-SelfCritique + LN-Shinn-2023-Reflexion with full architecture, ablation results, Python implementation

9. Generative Agents Memory Architecture (previously: not documented) → Now: PN-GenerativeAgents-Simulacra + LN-Park-2023-GenerativeAgents with memory stream scoring, reflection, SmartPMO mapping

10. PM Theory Context (previously: Flyvbjerg paper note but no permanent synthesis) → Now: PN-UniquenessTrap explicitly connecting Kahneman planning fallacy → Flyvbjerg uniqueness trap → PUMA RCF implementation


Remaining Gaps

High Priority (Required for PUMA thesis)

All High Priority Gaps Resolved — May 2026

GapStatusCreated
PN-ContextEngineering✅ Done2026-05-01
PN-MCP-ModelContextProtocol✅ Done2026-05-01
PN-TreeOfThoughts-Deliberate✅ Done2026-05-01
PN-RLHF-Constitutional✅ Done2026-05-01
PN-Transformer-MoE✅ Done2026-05-01

Medium Priority (Enriches thesis)

GapImpactSuggested Action
LN-Confidence-AcquisitionModelActive learning/label correction for TAWOS qualityCreate from DOI 10.1162/tacl_a_00734
PN-StatisticalValidation-FullSpearman ρ implementation details missingExtend existing note
Book: FrictionlessOrganizational friction → PM automation motivationCreate LN-Frictionless-book.md
Dataset: TAWOS deep-diveDataset-specific note is shallowExpand LN-TAWOS-Dataset.md

Low Priority (Future work / optional)

GapTopic
RSA paper (Rivest 1978)Only relevant if PUMA addresses authentication security topics
Zero-Knowledge Proofs (Quisquater 1990)Same — only if cryptographic privacy is in scope
Conway 1976Game theory / combinatorics — not directly PUMA-relevant
LN-CodeRabbit-2025Industry report on AI code review — Ch.2 background
LN-CodeLogicianTool note — only if PUMA integrates static analysis

Proposed New Research Directions

Based on the knowledge expansion, these are the most promising directions not currently covered in PUMA’s experimental scope:

1. Fine-Tuning Track (Stage 6)

PUMA currently evaluates frozen models. The PN-FineTuning-LoRA-Quantization note lays the groundwork for a fine-tuning experiment:

  • Research question: Does domain-adapted QLoRA fine-tuning on TAWOS data improve Macro-F1 by >10% over few-shot prompting, at acceptable training cost?
  • Feasibility: QLoRA on 8B model requires ~1 GPU-day; within TFG scope
  • Metrics: Same H1/H2 metrics + carbon cost comparison vs. prompting-only

2. Active Learning Loop (HITL + Label Correction)

The Confidence-Acquisition Model paper suggests combining HITL confidence gates with active learning:

  • Route low-confidence predictions to human review
  • Use human corrections as labeled training data for future fine-tuning
  • Track label quality improvement over time (MTTD reduction)

3. Ecological Validity Study

The LN-Collaborating-AIAgents-2025 paper shows field experiments provide evidence that lab studies (like PUMA) cannot. A follow-up study:

  • Deploy PUMA triage agent in a real software team’s Jira instance
  • Measure MTTR, sprint accuracy, developer satisfaction
  • Compare against PUMA lab results — quantify the lab-to-field gap

4. Multi-Project Generalization

PUMA uses TAWOS (specific project repositories). Key open question:

  • Does a model calibrated on Apache/JIRA projects generalize to enterprise Jira instances?
  • Experiment: Train on 80% of TAWOS projects, test on held-out 20% from different domains
  • Metric: Domain adaptation gap = F1 degradation from in-domain to out-of-domain

5. Carbon-Quality Pareto Frontier

Using PN-ComputationalSustainability data:

  • Map each model configuration to (F1, CO₂eq) coordinates
  • Identify Pareto-optimal configurations (best quality for given carbon budget)
  • Deliverable: A deployment decision matrix for organizations with sustainability constraints

Architectural Improvement Suggestions

Prompt Versioning System

Currently, prompts are hardcoded in experiment notebooks. Recommend:

40 - Projects/PUMA/41.8 Prompts/
  v1.0-h1-triage-zero-shot.md
  v1.0-h1-triage-few-shot-cot.md
  v2.0-h1-triage-few-shot-costar.md
  changelog.md

Each prompt version gets a semantic version number and changelog entry — enabling reproducibility across experiments.

Metrics Registry

A single source of truth for metric definitions, currently fragmented across experiment notes. The PN-Evaluation-Metrics-Comprehensive note serves this purpose but should be linked from all experiment files.

Experiment-to-Paper Traceability Matrix

A table mapping each experiment (EX-H1-ZeroShot, EX-H2-FewShot, etc.) to the thesis section it supports (Ch.3, Ch.4, Ch.5) — currently implicit.


MOCs Updated

  • MOC-LLM-Benchmarks-PM-AI: Added 9 new paper links + 13 new permanent note links
  • MOC-Methods-Frameworks: (pending update — add Reflexion, RCF, Deceptive Alignment sections)