Knowledge Coverage Report — PUMA Vault Expansion (April 2026)
Purpose
This report documents the systematic knowledge expansion of the PUMA vault conducted in April 2026. It identifies what was missing, what was added, remaining gaps, and proposed research directions.
Executive Summary
| Metric | Before Expansion | After Expansion (Apr) | After Expansion (May) | Delta Total |
|---|---|---|---|---|
| Literature notes (LN-) | ~95 | 105 | 105 | +10 |
| Permanent notes (PN-) | ~15 | 26 | 31 | +16 |
| Book notes | 0 | 4 | 4 | +4 |
| MOC entries updated | — | 2 | 3 | — |
May 2026 Update
5 high-priority permanent notes created (May 2026):
PN-ContextEngineering,PN-MCP-ModelContextProtocol,PN-TreeOfThoughts-Deliberate,PN-RLHF-Constitutional,PN-Transformer-MoE. MOC-Methods-Frameworks updated with all 5 entries.
What Was Added
New Literature Notes
| File | Topic | Key Contribution to PUMA |
|---|---|---|
LN-Shinn-2023-Reflexion | Agent self-improvement | Reflexion loop → Stage 4 iterative triage |
LN-Liu-2023-AgentBench | Agent benchmarking | SPR metric; open-source model capability gap |
LN-Park-2023-GenerativeAgents | Memory-augmented agents | SmartPMO persistent memory architecture |
LN-Xie-2023-OpenAgents | Agent triads | Data Agent → TAWOS processing pipeline |
LN-Hubinger-2019-LearnedOptimization | AI safety | Inner alignment → HITL theoretical basis |
LN-Incident-Management-AI-2023 | AIOps survey | MTTD/MTTR metrics; triage lifecycle positioning |
LN-AssistGUI-2023 | GUI automation | Stage 5 Jira integration without REST API |
LN-Collaborating-AIAgents-2025 | Field experiments | Ecological validity for augmentation debate |
LN-LLM-MAS-SE-2024 | MAS for SE | PM agent taxonomy; evaluation gap PUMA fills |
LN-Flyvbjerg-2023-UniquenessTrap | PM theory | Uniqueness Trap → PUMA’s motivating problem |
New Book Notes
| File | Author | Key Contribution to PUMA |
|---|---|---|
LN-Beck-1999-XPExplained | Kent Beck | Story points origin; planning game history |
LN-Lawrence-2024-AtomicHuman | Neil D. Lawrence | Atomic human concept; HITL theoretical framing |
LN-Suleiman-2023-ComingWave | Mustafa Suleiman | Macro AI policy context for ethics chapter |
LN-Shum-2025-PensarConPrompts | Jan Shum | CO-STAR framework; prompt engineering taxonomy |
New Permanent Notes
| File | Topic | Scope |
|---|---|---|
PN-Evaluation-Metrics-Comprehensive | All PUMA metrics | F1, MAE, SA, SPR, Wilcoxon, CI, BLEU, ROUGE, CO₂ |
PN-StatisticalValidation-Full | Statistics | Wilcoxon pipeline, bootstrap CI, multiple comparison |
PN-LLM-Models-PUMA | Model catalog | Llama, Mistral, Phi, Gemma, DeepSeek, GPT-4o, Claude |
PN-Reflexion-SelfCritique | Agent pattern | Verbal reinforcement loop + PUMA implementation |
PN-HITL-BoundedAutonomy | Ethics/Safety | Bounded autonomy framework + EU AI Act context |
PN-ComputationalSustainability | Sustainability | CodeCarbon integration + CO₂eq reporting |
PN-UniquenessTrap | PM theory | RCF + PUMA thesis framing |
PN-FineTuning-LoRA-Quantization | LLM training | LoRA, QLoRA, GGUF quantization reference |
PN-AlgorithmicBias | Ethics/Fairness | Bias taxonomy + detection + mitigation for PUMA |
PN-COSTAR-SelfConsistency | Prompting | CO-STAR + CoT + Self-Consistency + SPR strategies |
PN-GenerativeAgents-Simulacra | Agent architecture | Memory stream + reflection + planning for SmartPMO |
What Was Missing (and Is Now Covered)
Critical Gaps Filled
1. Evaluation Metrics (previously: only informal mentions scattered across experiment notes)
→ Now: PN-Evaluation-Metrics-Comprehensive with formal definitions, Python code, PUMA applicability table, and metric selection guide
2. Statistical Validation (previously: no systematic treatment)
→ Now: PN-StatisticalValidation-Full with normality testing, Wilcoxon pipeline, bootstrap CI, multiple comparison correction, and a complete reporting template
3. Model Technical Reference (previously: model names mentioned but no specifications)
→ Now: PN-LLM-Models-PUMA with architecture details, quantization options, VRAM requirements, Ollama identifiers, and selection rationale
4. Prompting Frameworks (previously: techniques used but not systematically documented)
→ Now: PN-COSTAR-SelfConsistency with full CO-STAR template, CoT/Few-Shot/Self-Consistency comparison, structured output strategies, and prompt anti-pattern checklist
5. Uniqueness Trap Theory (previously: referenced in Flyvbjerg note but no permanent synthesis)
→ Now: PN-UniquenessTrap with RCF algorithm, PUMA few-shot mapping, and Ch.1 thesis framing
6. AI Safety/HITL Theoretical Basis (previously: mentioned but not formally documented)
→ Now: PN-HITL-BoundedAutonomy with bounded autonomy framework, mesa-optimization risk, EU AI Act mapping, and PUMA design protocol
7. Carbon Footprint Tracking (previously: not documented)
→ Now: PN-ComputationalSustainability with CodeCarbon integration, emission formulas, hardware baselines, and cloud vs. local comparison
8. Agent Self-Improvement (previously: no dedicated note on Reflexion pattern)
→ Now: PN-Reflexion-SelfCritique + LN-Shinn-2023-Reflexion with full architecture, ablation results, Python implementation
9. Generative Agents Memory Architecture (previously: not documented)
→ Now: PN-GenerativeAgents-Simulacra + LN-Park-2023-GenerativeAgents with memory stream scoring, reflection, SmartPMO mapping
10. PM Theory Context (previously: Flyvbjerg paper note but no permanent synthesis)
→ Now: PN-UniquenessTrap explicitly connecting Kahneman planning fallacy → Flyvbjerg uniqueness trap → PUMA RCF implementation
Remaining Gaps
High Priority (Required for PUMA thesis)
All High Priority Gaps Resolved — May 2026
| Gap | Status | Created |
|---|---|---|
PN-ContextEngineering | ✅ Done | 2026-05-01 |
PN-MCP-ModelContextProtocol | ✅ Done | 2026-05-01 |
PN-TreeOfThoughts-Deliberate | ✅ Done | 2026-05-01 |
PN-RLHF-Constitutional | ✅ Done | 2026-05-01 |
PN-Transformer-MoE | ✅ Done | 2026-05-01 |
Medium Priority (Enriches thesis)
| Gap | Impact | Suggested Action |
|---|---|---|
LN-Confidence-AcquisitionModel | Active learning/label correction for TAWOS quality | Create from DOI 10.1162/tacl_a_00734 |
PN-StatisticalValidation-Full | Spearman ρ implementation details missing | Extend existing note |
| Book: Frictionless | Organizational friction → PM automation motivation | Create LN-Frictionless-book.md |
| Dataset: TAWOS deep-dive | Dataset-specific note is shallow | Expand LN-TAWOS-Dataset.md |
Low Priority (Future work / optional)
| Gap | Topic |
|---|---|
| RSA paper (Rivest 1978) | Only relevant if PUMA addresses authentication security topics |
| Zero-Knowledge Proofs (Quisquater 1990) | Same — only if cryptographic privacy is in scope |
| Conway 1976 | Game theory / combinatorics — not directly PUMA-relevant |
LN-CodeRabbit-2025 | Industry report on AI code review — Ch.2 background |
LN-CodeLogician | Tool note — only if PUMA integrates static analysis |
Proposed New Research Directions
Based on the knowledge expansion, these are the most promising directions not currently covered in PUMA’s experimental scope:
1. Fine-Tuning Track (Stage 6)
PUMA currently evaluates frozen models. The PN-FineTuning-LoRA-Quantization note lays the groundwork for a fine-tuning experiment:
- Research question: Does domain-adapted QLoRA fine-tuning on TAWOS data improve Macro-F1 by >10% over few-shot prompting, at acceptable training cost?
- Feasibility: QLoRA on 8B model requires ~1 GPU-day; within TFG scope
- Metrics: Same H1/H2 metrics + carbon cost comparison vs. prompting-only
2. Active Learning Loop (HITL + Label Correction)
The Confidence-Acquisition Model paper suggests combining HITL confidence gates with active learning:
- Route low-confidence predictions to human review
- Use human corrections as labeled training data for future fine-tuning
- Track label quality improvement over time (MTTD reduction)
3. Ecological Validity Study
The LN-Collaborating-AIAgents-2025 paper shows field experiments provide evidence that lab studies (like PUMA) cannot. A follow-up study:
- Deploy PUMA triage agent in a real software team’s Jira instance
- Measure MTTR, sprint accuracy, developer satisfaction
- Compare against PUMA lab results — quantify the lab-to-field gap
4. Multi-Project Generalization
PUMA uses TAWOS (specific project repositories). Key open question:
- Does a model calibrated on Apache/JIRA projects generalize to enterprise Jira instances?
- Experiment: Train on 80% of TAWOS projects, test on held-out 20% from different domains
- Metric: Domain adaptation gap = F1 degradation from in-domain to out-of-domain
5. Carbon-Quality Pareto Frontier
Using PN-ComputationalSustainability data:
- Map each model configuration to (F1, CO₂eq) coordinates
- Identify Pareto-optimal configurations (best quality for given carbon budget)
- Deliverable: A deployment decision matrix for organizations with sustainability constraints
Architectural Improvement Suggestions
Prompt Versioning System
Currently, prompts are hardcoded in experiment notebooks. Recommend:
40 - Projects/PUMA/41.8 Prompts/
v1.0-h1-triage-zero-shot.md
v1.0-h1-triage-few-shot-cot.md
v2.0-h1-triage-few-shot-costar.md
changelog.md
Each prompt version gets a semantic version number and changelog entry — enabling reproducibility across experiments.
Metrics Registry
A single source of truth for metric definitions, currently fragmented across experiment notes. The PN-Evaluation-Metrics-Comprehensive note serves this purpose but should be linked from all experiment files.
Experiment-to-Paper Traceability Matrix
A table mapping each experiment (EX-H1-ZeroShot, EX-H2-FewShot, etc.) to the thesis section it supports (Ch.3, Ch.4, Ch.5) — currently implicit.
MOCs Updated
MOC-LLM-Benchmarks-PM-AI: Added 9 new paper links + 13 new permanent note linksMOC-Methods-Frameworks: (pending update — add Reflexion, RCF, Deceptive Alignment sections)