Knowledge Coverage Report — PUMA Vault Expansion (April 2026)

Purpose

This report documents the systematic knowledge expansion of the PUMA vault conducted in April 2026. It identifies what was missing, what was added, remaining gaps, and proposed research directions.

Executive Summary

Metric	Before Expansion	After Expansion (Apr)	After Expansion (May)	Delta Total
Literature notes (LN-)	~95	105	105	+10
Permanent notes (PN-)	~15	26	31	+16
Book notes	0	4	4	+4
MOC entries updated	—	2	3	—

May 2026 Update

5 high-priority permanent notes created (May 2026): PN-ContextEngineering, PN-MCP-ModelContextProtocol, PN-TreeOfThoughts-Deliberate, PN-RLHF-Constitutional, PN-Transformer-MoE. MOC-Methods-Frameworks updated with all 5 entries.

What Was Added

New Literature Notes

File	Topic	Key Contribution to PUMA
`LN-Shinn-2023-Reflexion`	Agent self-improvement	Reflexion loop → Stage 4 iterative triage
`LN-Liu-2023-AgentBench`	Agent benchmarking	SPR metric; open-source model capability gap
`LN-Park-2023-GenerativeAgents`	Memory-augmented agents	SmartPMO persistent memory architecture
`LN-Xie-2023-OpenAgents`	Agent triads	Data Agent → TAWOS processing pipeline
`LN-Hubinger-2019-LearnedOptimization`	AI safety	Inner alignment → HITL theoretical basis
`LN-Incident-Management-AI-2023`	AIOps survey	MTTD/MTTR metrics; triage lifecycle positioning
`LN-AssistGUI-2023`	GUI automation	Stage 5 Jira integration without REST API
`LN-Collaborating-AIAgents-2025`	Field experiments	Ecological validity for augmentation debate
`LN-LLM-MAS-SE-2024`	MAS for SE	PM agent taxonomy; evaluation gap PUMA fills
`LN-Flyvbjerg-2023-UniquenessTrap`	PM theory	Uniqueness Trap → PUMA’s motivating problem

New Book Notes

File	Author	Key Contribution to PUMA
`LN-Beck-1999-XPExplained`	Kent Beck	Story points origin; planning game history
`LN-Lawrence-2024-AtomicHuman`	Neil D. Lawrence	Atomic human concept; HITL theoretical framing
`LN-Suleiman-2023-ComingWave`	Mustafa Suleiman	Macro AI policy context for ethics chapter
`LN-Shum-2025-PensarConPrompts`	Jan Shum	CO-STAR framework; prompt engineering taxonomy

New Permanent Notes

File	Topic	Scope
`PN-Evaluation-Metrics-Comprehensive`	All PUMA metrics	F1, MAE, SA, SPR, Wilcoxon, CI, BLEU, ROUGE, CO₂
`PN-StatisticalValidation-Full`	Statistics	Wilcoxon pipeline, bootstrap CI, multiple comparison
`PN-LLM-Models-PUMA`	Model catalog	Llama, Mistral, Phi, Gemma, DeepSeek, GPT-4o, Claude
`PN-Reflexion-SelfCritique`	Agent pattern	Verbal reinforcement loop + PUMA implementation
`PN-HITL-BoundedAutonomy`	Ethics/Safety	Bounded autonomy framework + EU AI Act context
`PN-ComputationalSustainability`	Sustainability	CodeCarbon integration + CO₂eq reporting
`PN-UniquenessTrap`	PM theory	RCF + PUMA project framing
`PN-FineTuning-LoRA-Quantization`	LLM training	LoRA, QLoRA, GGUF quantization reference
`PN-AlgorithmicBias`	Ethics/Fairness	Bias taxonomy + detection + mitigation for PUMA
`PN-COSTAR-SelfConsistency`	Prompting	CO-STAR + CoT + Self-Consistency + SPR strategies
`PN-GenerativeAgents-Simulacra`	Agent architecture	Memory stream + reflection + planning for SmartPMO

What Was Missing (and Is Now Covered)

Critical Gaps Filled

1. Evaluation Metrics (previously: only informal mentions scattered across experiment notes) → Now: PN-Evaluation-Metrics-Comprehensive with formal definitions, Python code, PUMA applicability table, and metric selection guide

2. Statistical Validation (previously: no systematic treatment) → Now: PN-StatisticalValidation-Full with normality testing, Wilcoxon pipeline, bootstrap CI, multiple comparison correction, and a complete reporting template

3. Model Technical Reference (previously: model names mentioned but no specifications) → Now: PN-LLM-Models-PUMA with architecture details, quantization options, VRAM requirements, Ollama identifiers, and selection rationale

4. Prompting Frameworks (previously: techniques used but not systematically documented) → Now: PN-COSTAR-SelfConsistency with full CO-STAR template, CoT/Few-Shot/Self-Consistency comparison, structured output strategies, and prompt anti-pattern checklist

5. Uniqueness Trap Theory (previously: referenced in Flyvbjerg note but no permanent synthesis) → Now: PN-UniquenessTrap with RCF algorithm, PUMA few-shot mapping, and Ch.1 project framing

6. AI Safety/HITL Theoretical Basis (previously: mentioned but not formally documented) → Now: PN-HITL-BoundedAutonomy with bounded autonomy framework, mesa-optimization risk, EU AI Act mapping, and PUMA design protocol

7. Carbon Footprint Tracking (previously: not documented) → Now: PN-ComputationalSustainability with CodeCarbon integration, emission formulas, hardware baselines, and cloud vs. local comparison

8. Agent Self-Improvement (previously: no dedicated note on Reflexion pattern) → Now: PN-Reflexion-SelfCritique + LN-Shinn-2023-Reflexion with full architecture, ablation results, Python implementation

9. Generative Agents Memory Architecture (previously: not documented) → Now: PN-GenerativeAgents-Simulacra + LN-Park-2023-GenerativeAgents with memory stream scoring, reflection, SmartPMO mapping

10. PM Theory Context (previously: Flyvbjerg paper note but no permanent synthesis) → Now: PN-UniquenessTrap explicitly connecting Kahneman planning fallacy → Flyvbjerg uniqueness trap → PUMA RCF implementation

Remaining Gaps

High Priority (Required for PUMA project)

All High Priority Gaps Resolved — May 2026

Gap	Status	Created
`PN-ContextEngineering`	✅ Done	2026-05-01
`PN-MCP-ModelContextProtocol`	✅ Done	2026-05-01
`PN-TreeOfThoughts-Deliberate`	✅ Done	2026-05-01
`PN-RLHF-Constitutional`	✅ Done	2026-05-01
`PN-Transformer-MoE`	✅ Done	2026-05-01

Medium Priority (Enriches the project)

Gap	Impact	Suggested Action
`LN-Confidence-AcquisitionModel`	Active learning/label correction for TAWOS quality	Create from DOI 10.1162/tacl_a_00734
`PN-StatisticalValidation-Full`	Spearman ρ implementation details missing	Extend existing note
Book: Frictionless	Organizational friction → PM automation motivation	Create `LN-Frictionless-book.md`
Dataset: TAWOS deep-dive	Dataset-specific note is shallow	Expand `LN-TAWOS-Dataset.md`

Low Priority (Future work / optional)

Gap	Topic
RSA paper (Rivest 1978)	Only relevant if PUMA addresses authentication security topics
Zero-Knowledge Proofs (Quisquater 1990)	Same — only if cryptographic privacy is in scope
Conway 1976	Game theory / combinatorics — not directly PUMA-relevant
`LN-CodeRabbit-2025`	Industry report on AI code review — Ch.2 background
`LN-CodeLogician`	Tool note — only if PUMA integrates static analysis

Proposed New Research Directions

Based on the knowledge expansion, these are the most promising directions not currently covered in PUMA’s experimental scope:

1. Fine-Tuning Track (Stage 6)

PUMA currently evaluates frozen models. The PN-FineTuning-LoRA-Quantization note lays the groundwork for a fine-tuning experiment:

Research question: Does domain-adapted QLoRA fine-tuning on TAWOS data improve Macro-F1 by >10% over few-shot prompting, at acceptable training cost?
Feasibility: QLoRA on 8B model requires ~1 GPU-day; within project scope
Metrics: Same H1/H2 metrics + carbon cost comparison vs. prompting-only

2. Active Learning Loop (HITL + Label Correction)

The Confidence-Acquisition Model paper suggests combining HITL confidence gates with active learning:

Route low-confidence predictions to human review
Use human corrections as labeled training data for future fine-tuning
Track label quality improvement over time (MTTD reduction)

3. Ecological Validity Study

The LN-Collaborating-AIAgents-2025 paper shows field experiments provide evidence that lab studies (like PUMA) cannot. A follow-up study:

Deploy PUMA triage agent in a real software team’s Jira instance
Measure MTTR, sprint accuracy, developer satisfaction
Compare against PUMA lab results — quantify the lab-to-field gap

4. Multi-Project Generalization

PUMA uses TAWOS (specific project repositories). Key open question:

Does a model calibrated on Apache/JIRA projects generalize to enterprise Jira instances?
Experiment: Train on 80% of TAWOS projects, test on held-out 20% from different domains
Metric: Domain adaptation gap = F1 degradation from in-domain to out-of-domain

5. Carbon-Quality Pareto Frontier

Using PN-ComputationalSustainability data:

Map each model configuration to (F1, CO₂eq) coordinates
Identify Pareto-optimal configurations (best quality for given carbon budget)
Deliverable: A deployment decision matrix for organizations with sustainability constraints

Architectural Improvement Suggestions

Prompt Versioning System

Currently, prompts are hardcoded in experiment notebooks. Recommend:

40 - Projects/PUMA/41.8 Prompts/
  v1.0-h1-triage-zero-shot.md
  v1.0-h1-triage-few-shot-cot.md
  v2.0-h1-triage-few-shot-costar.md
  changelog.md

Each prompt version gets a semantic version number and changelog entry — enabling reproducibility across experiments.

Metrics Registry

A single source of truth for metric definitions, currently fragmented across experiment notes. The PN-Evaluation-Metrics-Comprehensive note serves this purpose but should be linked from all experiment files.

Experiment-to-Paper Traceability Matrix

A table mapping each experiment (EX-H1-ZeroShot, EX-H2-FewShot, etc.) to the project section it supports (Ch.3, Ch.4, Ch.5) — currently implicit.

MOCs Updated

MOC-LLM-Benchmarks-PM-AI: Added 9 new paper links + 13 new permanent note links
MOC-Methods-Frameworks: (pending update — add Reflexion, RCF, Deceptive Alignment sections)

PUMA Vault

Explorador

Knowledge Coverage Report — PUMA Vault Expansion (April 2026)

Knowledge Coverage Report — PUMA Vault Expansion (April 2026)

Executive Summary

What Was Added

New Literature Notes

New Book Notes

New Permanent Notes

What Was Missing (and Is Now Covered)

Critical Gaps Filled

Remaining Gaps

High Priority (Required for PUMA project)

Medium Priority (Enriches the project)

Low Priority (Future work / optional)

Proposed New Research Directions

1. Fine-Tuning Track (Stage 6)

2. Active Learning Loop (HITL + Label Correction)

3. Ecological Validity Study

4. Multi-Project Generalization

5. Carbon-Quality Pareto Frontier

Architectural Improvement Suggestions

Prompt Versioning System

Metrics Registry

Experiment-to-Paper Traceability Matrix

MOCs Updated

Vista Gráfica

Tabla de Contenidos

PUMA Vault

Explorador

Knowledge Coverage Report — PUMA Vault Expansion (April 2026)

Knowledge Coverage Report — PUMA Vault Expansion (April 2026)

Executive Summary

What Was Added

New Literature Notes

New Book Notes

New Permanent Notes

What Was Missing (and Is Now Covered)

Critical Gaps Filled

Remaining Gaps

High Priority (Required for PUMA project)

Medium Priority (Enriches the project)

Low Priority (Future work / optional)

Proposed New Research Directions

1. Fine-Tuning Track (Stage 6)

2. Active Learning Loop (HITL + Label Correction)

3. Ecological Validity Study

4. Multi-Project Generalization

5. Carbon-Quality Pareto Frontier

Architectural Improvement Suggestions

Prompt Versioning System

Metrics Registry

Experiment-to-Paper Traceability Matrix

MOCs Updated

Related Notes

Vista Gráfica

Tabla de Contenidos