PUMA 307


PUMA Vault

Benchmark - Local LLM Evaluation Framework


PUMA Repo


PUMA Research Vault

Platform for Understanding and Management with Agents — Empirical Benchmark for LLM-Assisted ICT Project Management

Overview

A unified, multi-methodology knowledge system for research on autonomous AI agents applied to software project management. This vault integrates six knowledge management paradigms, multiple research methodologies, a rich prompt library, and a custom Claude AI skill system — all organized under the Johnny Decimal numbering scheme.


DestinationDescription
Vault GuideFull guide — structure, methodologies, workflows, skills, and the complete file index
README.mdGitHub landing page
00 - Home.mdNavigation hub

Vault Sections

SectionPurpose
00 - MetaTemplates, dashboards, plugin configuration
10 - InboxGTD capture point — process daily
20 - LiteraturePapers, books, datasets, videos, tools
30 - PermanentZettelkasten — atomic, permanent, linked notes
40 - ProjectsActive PUMA chapters, specs, experiments, BMAD
50 - AreasOngoing responsibilities — research, writing, code, ethics
60 - ResourcesPrompts, workflows, checklists, glossary, bibliography
70 - ArchiveCompleted and deprecated material
80 - MOCMaps of Content — semantic navigation layer
90 - GTDTasks, reviews, sprint boards

Research Mission

PUMA investigates whether autonomous LLM agents can perform practical software project management tasks — issue triage and effort estimation — with accuracy and reproducibility comparable to human experts. The project uses a multi-stage empirical pipeline: SLR → artifact design → LLM agent construction → experiment on real-world datasets (TAWOS, Jira SR) → statistical validation and replication package publication.

H1 — Triage

An LLM agent using few-shot prompting achieves >75% F1 on issue type and priority classification on the TAWOS dataset.

H2 — Estimation

An LLM agent using chain-of-thought reasoning achieves a Mean Relative Error ≤35% on story-point estimation on the Jira SR dataset.


Open Science

This ecosystem is organized into five functional categories corresponding to specific domains within the PUMA research cycle. All listed resources are publicly accessible and are part of the project’s transparency infrastructure, aligned with the Open Science principles and the commitment to the MIT License.


  • GitHub Repository: pumacp/PUMA
  • Zotero Library: PUMA group library
  • Datasets: Jira SR (Zenodo DOI: 10.5281/zenodo.5901893) · TAWOS (GitHub: SOLAR-group/TAWOS)

Youtube Playlist https://www.youtube.com/@PUMACapstoneProject

Research Discovery https://discovery.researcher.life/my-library/reading-list/1815730

Gemini (GEM) https://gemini.google.com/gem/1h-rxrzZagTsvX59_CGfaoDHjisJ48cz7?usp=sharing

Perplexity Space https://www.perplexity.ai/spaces/puma-6IpatdqAS_yOxg9j69qvAQ

Research Rabbit https://app.researchrabbit.ai/folder-shares/d8244f17-47f7-4f6c-a589-473876578b54

Google Drive https://drive.google.com/drive/folders/1TKbYhYqLIrq7liAPlSF7ztS2Bv0l7vZS?usp=sharing


Keywords

Claude (Anthropic) GPT-4o (OpenAI) Llama 3 (local) Ollama Python Zotero Obsidian GitHub Actions Semantic Scholar Docker Git GitHub Pages BrowserOS Opencode Jira Social Repository (Zenodo) TAWOS (The Agile Wisdom of the Open Source) APSTUD (proyecto TAWOS) MESOS (proyecto TAWOS) XD (proyecto TAWOS) Apache datasets (MESOS, XD, etc.) CoGEE (Context-Guided Effort Estimation) Jira Red Hat Dataset (2024) NLBSE (Natural Language-Based Software Engineering) Challenge SWE-bench GAIA AgentBench CodeRabbit AI vs Human Report Benchmark Stratified sampling methods Ollama Llama 3.2 8B Mistral 7B Phi-3.5 Mini Gemma 2 9B Gemma 4 DeepSeek-R1 DeepSeek-V3 GPT-4o 4.x (familia Anthropic) Qwen 3 (familia Alibaba) open-weights model Fine-tuning LoRA (Low-Rank Adaptation) Quantización (GGUF 4-bit) Transformer MoE (Mixture of Experts) LLM Agent Multi-agent system (MAS) Manager Agent (Supervisor Pattern) Agentic Swarm RAG (Retrieval-Augmented Generation) RAG pipelines Docker-based reproducibility GPU acceleration (NVIDIA stack) Wilcoxon signed-rank test Wilcoxon (p-value) Shapiro-Wilk test Spearman ρ Bootstrap F1-macro (Macro-F1) F1-micro (Micro-F1) F1-weighted Recall Precision Accuracy AUC-ROC MAE (Mean Absolute Error) RMSE (Root Mean Square Error) MdAE (Median Absolute Error) SA (Standardized Accuracy) ROUGE (Recall-Oriented Understudy for Gisting Evaluation) BLEU (Bilingual Evaluation Understudy) Perplexity Recall@k MRR (Mean Reciprocal Rank) r) Confidence Interval (CI) Deterministic configs (seed=42, temp=0) Constitutional AI Human-in-the-loop (HITL) Human validation Bounded autonomy CodeCarbon Strubell methodology (CO₂ eq) AI transparency & explainability Role displacement analysis Zettelkasten PARA method GTD (Getting Things Done) Map of Content (MOC) Johnny Decimal LLM Wiki (Karpathy) Cognitive Offloading Veritas Framework Automated Ticket Triage (Forethought-like systems) Incident Triage RCOIF (prompting framework) CO-STAR Chain-of-Thought (CoT) Chain-of-thought monitoring Zero-shot CoT Zero-shot prompting One-shot prompting Few-shot prompting Few-shot + CoT Few-shot-3 Contextual anchoring Interactive Guided Exploration (IGE) AMI Self-reflection loops (Reflexion) Tree-of-Thoughts (ToT) ReAct (Reason + Act) Prompt template System prompt Self-consistency prompting Structured output prompting IIPR DRCA Agentic Software Engineering Spec-Driven Development (SDD) Context-Driven Development (CDD) OpenSpec Spec Kit BMAD (Business Module Agentic Design) AI Swarm Architectures CrewAI orchestration AI-Native Infrastructure AI-Driven Research Methodologies Agentic Research MIT Student Method (Working Paper 316) Keshav’s Three-Pass Method PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) SLR (Systematic Literature Review) Kitchenham Guidelines DSR (Design Science Research) Grounded Theory AI Scientist H₁) LLM-based literature review Scientific Ideation Loop Controlled Experimentation Story points Uniqueness Trap PMBOK (Project Management Body of Knowledge) ITIL 4 Scrum SLA (Service Level Agreement) KPI (Key Performance Indicator) Heurístic Baseline MTTD (Mean Time to Detect) MTTR (Mean Time to Resolve) Overfitting Ground truth MIT License Algorithmic Bias Context window Context engineering MCP (Model Context Protocol) MVP (Minimum Viable Product) scikit-learn pandas scipy matplotlib Git Jupyter Notebook VS Code Cursor AI Zotero LM Studio TF-IDF + SVM (baseline classifier) CodeLogician AGENTS.md RLHF (Reinforcement Learning from Human Feedback) Claude Skills


Update Log

This vault reflects the ecosystem status as of April 2026. In accordance with the Veritas Framework proactive disclosure principle, this will be updated in each partial delivery to reflect new tools or public profile updates.


PUMA Research Vault  ·  Last updated: April 2026  ·  License: MIT

Vault v1.0 · April 2026 · PARA + GTD + Zettelkasten + Johnny Decimal + SDD + BMAD + Keshav + CDD · GitHub