Benchmark - Local LLM Evaluation Framework
PUMA Research Vault
Platform for Understanding and Management with Agents — Empirical Benchmark for LLM-Assisted ICT Project Management
Overview
A unified, multi-methodology knowledge system for research on autonomous AI agents applied to software project management. This vault integrates six knowledge management paradigms, multiple research methodologies, a rich prompt library, and a custom Claude AI skill system — all organized under the Johnny Decimal numbering scheme.
Navigation
| Destination | Description |
|---|---|
| Vault Guide | Full guide — structure, methodologies, workflows, skills, and the complete file index |
| README.md | GitHub landing page |
| 00 - Home.md | Navigation hub |
Vault Sections
| Section | Purpose |
|---|---|
| 00 - Meta | Templates, dashboards, plugin configuration |
| 10 - Inbox | GTD capture point — process daily |
| 20 - Literature | Papers, books, datasets, videos, tools |
| 30 - Permanent | Zettelkasten — atomic, permanent, linked notes |
| 40 - Projects | Active PUMA chapters, specs, experiments, BMAD |
| 50 - Areas | Ongoing responsibilities — research, writing, code, ethics |
| 60 - Resources | Prompts, workflows, checklists, glossary, bibliography |
| 70 - Archive | Completed and deprecated material |
| 80 - MOC | Maps of Content — semantic navigation layer |
| 90 - GTD | Tasks, reviews, sprint boards |
Research Mission
PUMA investigates whether autonomous LLM agents can perform practical software project management tasks — issue triage and effort estimation — with accuracy and reproducibility comparable to human experts. The project uses a multi-stage empirical pipeline: SLR → artifact design → LLM agent construction → experiment on real-world datasets (TAWOS, Jira SR) → statistical validation and replication package publication.
H1 — Triage
An LLM agent using few-shot prompting achieves >75% F1 on issue type and priority classification on the TAWOS dataset.
H2 — Estimation
An LLM agent using chain-of-thought reasoning achieves a Mean Relative Error ≤35% on story-point estimation on the Jira SR dataset.
Open Science
This ecosystem is organized into five functional categories corresponding to specific domains within the PUMA research cycle. All listed resources are publicly accessible and are part of the project’s transparency infrastructure, aligned with the Open Science principles and the commitment to the MIT License.
Key Links
- GitHub Repository: pumacp/PUMA
- Zotero Library: PUMA group library
- Datasets: Jira SR (Zenodo DOI: 10.5281/zenodo.5901893) · TAWOS (GitHub: SOLAR-group/TAWOS)
Other Links
Youtube Playlist https://www.youtube.com/@PUMACapstoneProject
Research Discovery https://discovery.researcher.life/my-library/reading-list/1815730
Gemini (GEM) https://gemini.google.com/gem/1h-rxrzZagTsvX59_CGfaoDHjisJ48cz7?usp=sharing
Perplexity Space https://www.perplexity.ai/spaces/puma-6IpatdqAS_yOxg9j69qvAQ
Research Rabbit https://app.researchrabbit.ai/folder-shares/d8244f17-47f7-4f6c-a589-473876578b54
Google Drive https://drive.google.com/drive/folders/1TKbYhYqLIrq7liAPlSF7ztS2Bv0l7vZS?usp=sharing
Keywords
Claude (Anthropic) GPT-4o (OpenAI) Llama 3 (local) Ollama Python Zotero Obsidian GitHub Actions Semantic Scholar Docker Git GitHub Pages BrowserOS Opencode Jira Social Repository (Zenodo) TAWOS (The Agile Wisdom of the Open Source) APSTUD (proyecto TAWOS) MESOS (proyecto TAWOS) XD (proyecto TAWOS) Apache datasets (MESOS, XD, etc.) CoGEE (Context-Guided Effort Estimation) Jira Red Hat Dataset (2024) NLBSE (Natural Language-Based Software Engineering) Challenge SWE-bench GAIA AgentBench CodeRabbit AI vs Human Report Benchmark Stratified sampling methods Ollama Llama 3.2 8B Mistral 7B Phi-3.5 Mini Gemma 2 9B Gemma 4 DeepSeek-R1 DeepSeek-V3 GPT-4o 4.x (familia Anthropic) Qwen 3 (familia Alibaba) open-weights model Fine-tuning LoRA (Low-Rank Adaptation) Quantización (GGUF 4-bit) Transformer MoE (Mixture of Experts) LLM Agent Multi-agent system (MAS) Manager Agent (Supervisor Pattern) Agentic Swarm RAG (Retrieval-Augmented Generation) RAG pipelines Docker-based reproducibility GPU acceleration (NVIDIA stack) Wilcoxon signed-rank test Wilcoxon (p-value) Shapiro-Wilk test Spearman ρ Bootstrap F1-macro (Macro-F1) F1-micro (Micro-F1) F1-weighted Recall Precision Accuracy AUC-ROC MAE (Mean Absolute Error) RMSE (Root Mean Square Error) MdAE (Median Absolute Error) SA (Standardized Accuracy) ROUGE (Recall-Oriented Understudy for Gisting Evaluation) BLEU (Bilingual Evaluation Understudy) Perplexity Recall@k MRR (Mean Reciprocal Rank) r) Confidence Interval (CI) Deterministic configs (seed=42, temp=0) Constitutional AI Human-in-the-loop (HITL) Human validation Bounded autonomy CodeCarbon Strubell methodology (CO₂ eq) AI transparency & explainability Role displacement analysis Zettelkasten PARA method GTD (Getting Things Done) Map of Content (MOC) Johnny Decimal LLM Wiki (Karpathy) Cognitive Offloading Veritas Framework Automated Ticket Triage (Forethought-like systems) Incident Triage RCOIF (prompting framework) CO-STAR Chain-of-Thought (CoT) Chain-of-thought monitoring Zero-shot CoT Zero-shot prompting One-shot prompting Few-shot prompting Few-shot + CoT Few-shot-3 Contextual anchoring Interactive Guided Exploration (IGE) AMI Self-reflection loops (Reflexion) Tree-of-Thoughts (ToT) ReAct (Reason + Act) Prompt template System prompt Self-consistency prompting Structured output prompting IIPR DRCA Agentic Software Engineering Spec-Driven Development (SDD) Context-Driven Development (CDD) OpenSpec Spec Kit BMAD (Business Module Agentic Design) AI Swarm Architectures CrewAI orchestration AI-Native Infrastructure AI-Driven Research Methodologies Agentic Research MIT Student Method (Working Paper 316) Keshav’s Three-Pass Method PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) SLR (Systematic Literature Review) Kitchenham Guidelines DSR (Design Science Research) Grounded Theory AI Scientist H₁) LLM-based literature review Scientific Ideation Loop Controlled Experimentation Story points Uniqueness Trap PMBOK (Project Management Body of Knowledge) ITIL 4 Scrum SLA (Service Level Agreement) KPI (Key Performance Indicator) Heurístic Baseline MTTD (Mean Time to Detect) MTTR (Mean Time to Resolve) Overfitting Ground truth MIT License Algorithmic Bias Context window Context engineering MCP (Model Context Protocol) MVP (Minimum Viable Product) scikit-learn pandas scipy matplotlib Git Jupyter Notebook VS Code Cursor AI Zotero LM Studio TF-IDF + SVM (baseline classifier) CodeLogician AGENTS.md RLHF (Reinforcement Learning from Human Feedback) Claude Skills
Update Log
This vault reflects the ecosystem status as of April 2026. In accordance with the Veritas Framework proactive disclosure principle, this will be updated in each partial delivery to reflect new tools or public profile updates.
PUMA Research Vault · Last updated: April 2026 · License: MIT
Vault v1.0 · April 2026 · PARA + GTD + Zettelkasten + Johnny Decimal + SDD + BMAD + Keshav + CDD · GitHub