Home
Local, reproducible, sustainable benchmarking for LLMs in ICT Project Management.
PUMA Platform
PUMA ·
PUMA Community ·
PUMA Vault
PUMA Info
Youtube ·
PUMA Wiki ·
PUMA Community Wiki ·
NotebookLM ·
Drive (info)
PUMA Contact
Reddit ·
Discord ·
GitHub Discussions ·
Twitter/X ·
|
**F**ollowing empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies. **O**bserved widely, these persist despite abundant historical data. **L**aying a rigorous foundation requires reproducible benchmarking. **L**everaging labeled datasets enables systematic evaluation of LLM performance. **O**utcomes are compared using quantitative metrics and statistical analysis. **W**ith an incremental design, a minimal viable benchmark is defined. **T**hrough open-source release, results become reproducible and verifiable. **H**ence, the framework supports extensibility across models and tasks. **E**ventually, it enables integration into real organizational settings. |
**W**ithin ICT environments, recurring inefficiencies hinder effective decision-making. **H**eterogeneous data sources complicate prioritization and estimation processes. **I**n response, this work builds a reproducible LLM-based benchmark. **T**he focus is on issue triage and story-point estimation tasks. **E**valuation follows controlled experiments with statistical validation. **P**rotocols ensure reproducibility through fixed parameters and configurations. **U**sing carbon tracking, the framework measures energy impact. **M**oreover, the MVP delivers a valid and original contribution. **A**ll artefacts are released as open source for replication and extension. |
PUMA Community
HF Organization ·
HF Submissions ·
HF Leaderboard ·
Zenodo ·
Kaggle ·
Zotero
PUMA Code
PUMA Project ·
PUMA Community ·
PUMA Vault
Get started · Cite PUMA · Join the community
What is PUMA?¶
PUMA is a local-first, open-source benchmarking platform that evaluates open-weight large language models (LLMs) on ICT Project Management tasks — specifically issue triage (classifying a ticket's priority) and story-point estimation (predicting effort). It runs entirely on your own hardware through Ollama: no API calls, no accounts, and no data ever leaves your machine. Every run is deterministic, so the same inputs produce byte-identical predictions, and every run is sustainability-aware — energy use and carbon are measured with CodeCarbon.
PUMA was built to test a hypothesis: that AI systems can be evaluated rigorously, and that the evaluation itself can be reproducible, auditable, and free. Around the tool sits PUMA Community, a public archive where any researcher or practitioner can publish their results with cryptographic integrity — turning a private benchmark into a shared, verifiable body of evidence.
The questions PUMA addresses¶
PUMA began as a set of open questions about research, software, and AI:
- Can rigorous academic research be conducted using AI tools?
- Can scientific studies conducted with AI tools be scientifically replicated?
- What is the state of the art in software development built with AI tools?
- Can software built with AI tools be efficiently audited?
- Can a local, free, automated project-management evaluation platform be built?
- Is there a paradigm shift underway in research, software development, and project management?
- What is the cost — financial and environmental — of adopting LLMs in business or project-management processes?
- Can LLM capabilities on a concrete scenario be measured scientifically, alongside their environmental consequences?
- Can gaps in the literature be detected using AI tools — can AI surface new problems and new solutions?
- Can a research project be documented in a live, automated, reproducible, real-time way?
- Can the resulting knowledge be shared publicly, without authorship, investment, or time restrictions?
- Can AI tools be studied using AI tools — can the object of study also be an instrument of its own study?
PUMA is one concrete, working answer to these questions: a reproducible artefact whose construction and evidence are public end-to-end.
The two phases of PUMA's development¶
PUMA was built in two openly documented phases
1 — Research. A structured literature review (Keshav's three-pass method, PRISMA 2020 review structure), methodology design (Design Science Research), hypothesis formulation, and experimental design.
2 — Artefact construction. Two artefacts: the PUMA benchmark tool (this repository) and PUMA Community (the public submission hub). Both phases are published with full reproducibility.
How PUMA works¶
PUMA is organized as a six-layer architecture, each layer with a single responsibility:
| Layer | Responsibility |
|---|---|
| Orchestrator | Drives a run-spec end to end (load → infer → evaluate → persist). |
| Runtime | Talks to Ollama locally, with deterministic, bounded retry. |
| Models | Discovers and describes the local model catalog (read-only). |
| Evaluation | Computes task metrics (F1-macro, MAE, calibration, latency). |
| Diagnostics | puma doctor / puma env — health and environment checks. |
| Community | Builds, validates, and publishes submissions with integrity hashes. |
Results are stored in a bi-temporal SQLite database; a Streamlit dashboard visualizes runs; and CodeCarbon records energy and emissions on every run. Crucially, execution is local-only — PUMA never makes an external inference call.
What PUMA measures¶
- F1-macro — issue triage (multi-class classification).
- MAE — story-point estimation (numeric prediction).
- Per-class precision / recall / F1 — where the model succeeds or fails.
- Calibration (ECE) — when log-probabilities are available.
- Latency — p50, p95, p99 per inference.
- Sustainability — grams of CO₂-equivalent and kWh, via CodeCarbon.
- Reproducibility — byte-identical predictions across repeated runs.
Why use PUMA?¶
- 100% local — code and data never leave your machine.
- Reproducible — same inputs produce the same outputs, byte for byte.
- Sustainable — measures and reports its own environmental impact.
- Free — no API keys, no paywalls, no commercial licenses.
- Open — MIT-licensed and community-contributed.
- Multi-model — Qwen, Mistral, Llama, and Gemma families.
- Multi-hardware — CPU-only or GPU profiles, auto-detected.
- Verifiable — cryptographic integrity on every submission.
- Statistically rigorous — Wilcoxon validation and falsifiable hypotheses.
Quick start¶
# 1. Install — via pip (available with the v4.0.0 release)
pip install puma-cp
# ...or via Docker (available with the v4.0.0 release)
docker pull ghcr.io/pumacp/puma:latest
curl -O https://raw.githubusercontent.com/pumacp/puma/main/docker-compose.yml && docker compose up
# ...or from source (works today)
git clone https://github.com/pumacp/puma && cd puma
pip install -e ".[dev]"
# 2. Check your environment (Ollama, models, hardware, database)
puma doctor
# 3. See which models are available locally
puma models list
# 4. Run your first benchmark
puma run specs/runs/baseline_triage.yaml
# 5. Package the result for sharing (local dry-run, no network)
puma share-results --dry-run --run-id <run_id> --yes
- Install the package and its development extras.
puma doctorconfirms Ollama is reachable, a model is present, and the hardware profile is detected.puma models listshows the models pulled locally.puma runexecutes a run-spec and printsRun complete: <run_id>.puma share-results --dry-runbuilds a submission package on disk without touching the network.
Practical tutorials¶
list shows what is pulled locally; recommended shows the curated catalog
and which entries you still need to ollama pull.
submission.json + predictions.jsonl package locally for review
before any external publication.
0 means verified, exit 1 means the file does not match.
Use cases¶
- Academic research — rigorously evaluate model capabilities for PMO tasks and publish the findings.
- Pre-production evaluation — test which open-weight LLM fits your use case before committing to it.
- Model comparison — benchmark several models on the same task with statistical validation.
- Sustainability auditing — measure the environmental cost of AI-enabled workflows.
- Reproducibility verification — independently confirm a published result.
PUMA Community¶
PUMA Community is the public archive of community-contributed benchmark results.
- What it is — a public, governance-first repository of submissions.
- How it works — you open a pull request with a submission JSON; CI validates it against the schema and integrity hash, auto-merges if valid, and mirrors it to Hugging Face, Zenodo, and Kaggle.
- Why it matters — submissions are cryptographically verifiable, follow FAIR data principles, and receive citable DOIs.
- How to contribute — see the contributing guide.
- Status — browse the public leaderboard.
Research with PUMA Vault¶
PUMA Vault is the public knowledge graph behind the research process. It is built with Obsidian using PARA, GTD, and Zettelkasten methods, linking literature notes, methodology decisions, and findings into a navigable web. Browse it at pumacp.github.io/puma-vault.
Methodologies¶
- Design Science Research (DSR) — for artefact construction.
- Spec-Driven Development (SDD) — for the codebase.
- Keshav's three-pass method — for reading the literature.
- PRISMA 2020 — for the systematic-review structure.
- Wilcoxon signed-rank test — for non-parametric statistical validation.
- Marco Veritas protocol — disciplined AI-tool use in research: verify primary sources, never cite what cannot be checked.
- APA 7th edition — for citations.
Cost analysis¶
- Financial cost — zero: no API keys, no paywalls, no commercial licenses.
- Environmental cost — on the
gpu-entryprofile, a 200-instance triage run is on the order of 0.1–0.2 gCO₂-equivalent (CodeCarbon-measured; CPU-only profiles draw more energy and take longer). - Hardware — 16 GB RAM minimum for the CPU-only profile; a GPU is optional and speeds runs up substantially.
Community¶
- Discord — discord.gg/fVhcpHREJv
- GitHub Discussions — on the puma-community repository.
- Contribute — start with the contributing guide.
- Report issues — open an issue on the relevant repository.
Resources¶
Code repositories¶
- PUMA benchmark tool — https://github.com/pumacp/puma
- PUMA Community — https://github.com/pumacp/puma-community
- PUMA Vault — https://github.com/pumacp/puma-vault
Documentation sites¶
- PUMA docs — https://pumacp.github.io/puma/
- Technical reference —
docs/technical_reference.md(architecture, configuration, JSON Schema, ORM, CLI overview, glossary, decisions timeline) - PUMA Community — https://pumacp.github.io/puma-community/
- PUMA Vault — https://pumacp.github.io/puma-vault/
- Wiki (tool) — https://github.com/pumacp/puma/wiki · Wiki (community) — https://github.com/pumacp/puma-community/wiki
Hugging Face Hub¶
- Organization — https://huggingface.co/pumaproject
- Dataset of submissions — https://huggingface.co/datasets/pumaproject/puma-community-submissions
- Leaderboard (Gradio Space) — https://huggingface.co/spaces/pumaproject/puma-leaderboard
- Personal namespace — https://huggingface.co/pumacp
Persistent archives & catalogs¶
- Zenodo community (production) — https://zenodo.org/communities/pumacp
- Zenodo community (sandbox) — https://sandbox.zenodo.org/communities/pumacp
- Source dataset (Jira Social Repository) — https://doi.org/10.5281/zenodo.5901893
- Kaggle dataset — https://www.kaggle.com/datasets/pumacp/puma-community-submissions
Knowledge management & research¶
- Zotero library — https://www.zotero.org/pumacp/library
- Google Drive (PDF repository) — https://drive.google.com/drive/folders/1TKbYhYqLIrq7liAPISF7ztS2Bv0l7vZS?usp=sharing
- ResearchRabbit map 1 — https://app.researchrabbit.ai/folder-shares/d8244f17-47f7-4f6c-a589-473876578b54
- ResearchRabbit map 2 — https://app.researchrabbit.ai/folder-shares/b6c00471-2f28-4c66-85f5-ab5399470228
Conversation¶
- Discord — https://discord.gg/fVhcpHREJv
- Contact — pumacapstoneproject@gmail.com
Citation¶
If you use PUMA in your work, please cite it:
@software{puma_benchmark,
title = {PUMA: Local, reproducible benchmarking for LLMs in ICT Project Management},
author = {{The PUMA Project}},
year = {2026},
url = {https://github.com/pumacp/puma},
note = {Zenodo DOI forthcoming}
}
APA (7th edition):
The PUMA Project. (2026). PUMA: Local, reproducible benchmarking for LLMs in ICT Project Management [Computer software]. https://github.com/pumacp/puma
Note
A Zenodo DOI is forthcoming and will be appended here after the first DOI-backed snapshot.
Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
Observed widely, these persist despite abundant historical data.
Laying a rigorous foundation requires reproducible benchmarking.
Leveraging labeled datasets enables systematic evaluation of LLM performance.
Outcomes are compared using quantitative metrics and statistical analysis.
With an incremental design, a minimal viable benchmark is defined.
Through open-source release, results become reproducible and verifiable.
Hence, the framework supports extensibility across models and tasks.
Eventually, it enables integration into real organizational settings.
Within ICT environments, recurring inefficiencies hinder effective decision-making.
Heterogeneous data sources complicate prioritization and estimation processes.
In response, this work builds a reproducible LLM-based benchmark.
The focus is on issue triage and story-point estimation tasks.
Evaluation follows controlled experiments with statistical validation.
Protocols ensure reproducibility through fixed parameters and configurations.
Using carbon tracking, the framework measures energy impact.
Moreover, the MVP delivers a valid and original contribution.
All artefacts are released as open source for replication and extension.
PUMA is released under the MIT License. Built with MkDocs Material.