PUMA 307

Benchmark - Local LLM Evaluation Framework

PUMA Vault

Platform for Understanding and Management with Agents — Empirical Benchmark for LLM-Assisted ICT Project Management

Overview

A unified, multi-methodology knowledge system for research on autonomous AI agents applied to software project management. This vault integrates six knowledge management paradigms, multiple research methodologies, a rich prompt library, and a custom Claude AI skill system — all organized under the Johnny Decimal numbering scheme.

Destination	Description
Vault Guide	Full guide — structure, methodologies, workflows, skills, and the complete file index
README.md	GitHub landing page
00 - Home.md	Navigation hub

Vault Sections

Section	Purpose
00 - Meta	Templates, dashboards, plugin configuration
10 - Inbox	GTD capture point — process daily
20 - Literature	Papers, books, datasets, videos, tools
30 - Permanent	Zettelkasten — atomic, permanent, linked notes
40 - Projects	Active PUMA chapters, specs, experiments, BMAD
50 - Areas	Ongoing responsibilities — research, writing, code, ethics
60 - Resources	Prompts, workflows, checklists, glossary, bibliography
70 - Archive	Completed and deprecated material
80 - MOC	Maps of Content — semantic navigation layer
90 - GTD	Tasks, reviews, sprint boards

Research Mission

PUMA investigates whether autonomous LLM agents can perform practical software project management tasks — issue triage and effort estimation — with accuracy and reproducibility comparable to human experts. The project uses a multi-stage empirical pipeline: SLR → artifact design → LLM agent construction → experiment on real-world datasets (TAWOS, Jira SR) → statistical validation and replication package publication.

H1 — Triage

An LLM agent using few-shot prompting achieves >75% F1 on issue type and priority classification on the TAWOS dataset.

H2 — Estimation

An LLM agent using chain-of-thought reasoning achieves a Mean Relative Error ≤35% on story-point estimation on the Jira SR dataset.

Open Science

This ecosystem is organized into five functional categories corresponding to specific domains within the PUMA research cycle. All listed resources are publicly accessible and are part of the project’s transparency infrastructure, aligned with the Open Science principles and the commitment to the MIT License.

Key Links

GitHub Repository: pumacp/PUMA
Zotero Library: PUMA group library
Datasets: Jira SR (Zenodo DOI: 10.5281/zenodo.5901893) · TAWOS (GitHub: SOLAR-group/TAWOS)

Keywords

Claude (Anthropic) GPT-4o (OpenAI) Llama 3 (local) Ollama Python Zotero Obsidian GitHub Actions Semantic Scholar Docker Git GitHub Pages BrowserOS Opencode Jira Social Repository (Zenodo) TAWOS (The Agile Wisdom of the Open Source) APSTUD (proyecto TAWOS) MESOS (proyecto TAWOS) XD (proyecto TAWOS) Apache datasets (MESOS, XD, etc.) CoGEE (Context-Guided Effort Estimation) Jira Red Hat Dataset (2024) NLBSE (Natural Language-Based Software Engineering) Challenge SWE-bench GAIA AgentBench CodeRabbit AI vs Human Report Benchmark Stratified sampling methods Ollama Llama 3.2 8B Mistral 7B Phi-3.5 Mini Gemma 2 9B Gemma 4 DeepSeek-R1 DeepSeek-V3 GPT-4o 4.x (familia Anthropic) Qwen 3 (familia Alibaba) open-weights model Fine-tuning LoRA (Low-Rank Adaptation) Quantización (GGUF 4-bit) Transformer MoE (Mixture of Experts) LLM Agent Multi-agent system (MAS) Manager Agent (Supervisor Pattern) Agentic Swarm RAG (Retrieval-Augmented Generation) RAG pipelines Docker-based reproducibility GPU acceleration (NVIDIA stack) Wilcoxon signed-rank test Wilcoxon (p-value) Shapiro-Wilk test Spearman ρ Bootstrap F1-macro (Macro-F1) F1-micro (Micro-F1) F1-weighted Recall Precision Accuracy AUC-ROC MAE (Mean Absolute Error) RMSE (Root Mean Square Error) MdAE (Median Absolute Error) SA (Standardized Accuracy) ROUGE (Recall-Oriented Understudy for Gisting Evaluation) BLEU (Bilingual Evaluation Understudy) Perplexity Recall@k MRR (Mean Reciprocal Rank) r) Confidence Interval (CI) Deterministic configs (seed=42, temp=0) Constitutional AI Human-in-the-loop (HITL) Human validation Bounded autonomy CodeCarbon Strubell methodology (CO₂ eq) AI transparency & explainability Role displacement analysis Zettelkasten PARA method GTD (Getting Things Done) Map of Content (MOC) Johnny Decimal LLM Wiki (Karpathy) Cognitive Offloading Veritas Framework Automated Ticket Triage (Forethought-like systems) Incident Triage RCOIF (prompting framework) CO-STAR Chain-of-Thought (CoT) Chain-of-thought monitoring Zero-shot CoT Zero-shot prompting One-shot prompting Few-shot prompting Few-shot + CoT Few-shot-3 Contextual anchoring Interactive Guided Exploration (IGE) AMI Self-reflection loops (Reflexion) Tree-of-Thoughts (ToT) ReAct (Reason + Act) Prompt template System prompt Self-consistency prompting Structured output prompting IIPR DRCA Agentic Software Engineering Spec-Driven Development (SDD) Context-Driven Development (CDD) OpenSpec Spec Kit BMAD (Business Module Agentic Design) AI Swarm Architectures CrewAI orchestration AI-Native Infrastructure AI-Driven Research Methodologies Agentic Research MIT Student Method (Working Paper 316) Keshav’s Three-Pass Method PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) SLR (Systematic Literature Review) Kitchenham Guidelines DSR (Design Science Research) Grounded Theory AI Scientist H₁) LLM-based literature review Scientific Ideation Loop Controlled Experimentation Story points Uniqueness Trap PMBOK (Project Management Body of Knowledge) ITIL 4 Scrum SLA (Service Level Agreement) KPI (Key Performance Indicator) Heurístic Baseline MTTD (Mean Time to Detect) MTTR (Mean Time to Resolve) Overfitting Ground truth MIT License Algorithmic Bias Context window Context engineering MCP (Model Context Protocol) MVP (Minimum Viable Product) scikit-learn pandas scipy matplotlib Git Jupyter Notebook VS Code Cursor AI Zotero LM Studio TF-IDF + SVM (baseline classifier) CodeLogician AGENTS.md RLHF (Reinforcement Learning from Human Feedback) Claude Skills

Update Log

This vault reflects the ecosystem status as of June 2026. In accordance with the Veritas Framework proactive disclosure principle, this will be updated in each partial delivery to reflect new tools or public profile updates.

PUMA Vault · Last updated: June 2026 · License: MIT

Vault v1.0 · June 2026 · PARA + GTD + Zettelkasten + Johnny Decimal + SDD + BMAD + Keshav + CDD · GitHub

PUMA Vault

Explorador

PUMA Vault

Benchmark - Local LLM Evaluation Framework

PUMA Vault

Navigation

Vault Sections

Research Mission

Key Links

Other Links

Keywords

Vista Gráfica

Tabla de Contenidos

Retroenlaces