Home

PUMA logo

Local, reproducible, sustainable benchmarking for LLMs in ICT Project Management.

_{PUMA Platform}
PUMA · PUMA Community · PUMA Vault

_{PUMA Info}
Youtube · PUMA Wiki · PUMA Community Wiki · NotebookLM · Drive (info)

_{PUMA Contact}
Reddit · Discord · GitHub Discussions · Twitter/X ·

_{PUMA Project}

**F**ollowing empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
**O**bserved widely, these persist despite abundant historical data.
**L**aying a rigorous foundation requires reproducible benchmarking.
**L**everaging labeled datasets enables systematic evaluation of LLM performance.
**O**utcomes are compared using quantitative metrics and statistical analysis.
**W**ith an incremental design, a minimal viable benchmark is defined.
**T**hrough open-source release, results become reproducible and verifiable.
**H**ence, the framework supports extensibility across models and tasks.
**E**ventually, it enables integration into real organizational settings.

**W**ithin ICT environments, recurring inefficiencies hinder effective decision-making.
**H**eterogeneous data sources complicate prioritization and estimation processes.
**I**n response, this work builds a reproducible LLM-based benchmark.
**T**he focus is on issue triage and story-point estimation tasks.
**E**valuation follows controlled experiments with statistical validation.
**P**rotocols ensure reproducibility through fixed parameters and configurations.
**U**sing carbon tracking, the framework measures energy impact.
**M**oreover, the MVP delivers a valid and original contribution.
**A**ll artefacts are released as open source for replication and extension.

_{PUMA Community}
HF Organization · HF Submissions · HF Leaderboard · Zenodo · Kaggle · Zotero

_{PUMA Code}
PUMA Project · PUMA Community · PUMA Vault

Get started · Cite PUMA · Join the community

What is PUMA?¶

PUMA is a local-first, open-source benchmarking platform that evaluates open-weight large language models (LLMs) on ICT Project Management tasks — specifically issue triage (classifying a ticket's priority) and story-point estimation (predicting effort). It runs entirely on your own hardware through Ollama: no API calls, no accounts, and no data ever leaves your machine. Every run is deterministic, so the same inputs produce byte-identical predictions, and every run is sustainability-aware — energy use and carbon are measured with CodeCarbon.

PUMA was built to test a hypothesis: that AI systems can be evaluated rigorously, and that the evaluation itself can be reproducible, auditable, and free. Around the tool sits PUMA Community, a public archive where any researcher or practitioner can publish their results with cryptographic integrity — turning a private benchmark into a shared, verifiable body of evidence.

The questions PUMA addresses¶

PUMA began as a set of open questions about research, software, and AI:

Can rigorous academic research be conducted using AI tools?
Can scientific studies conducted with AI tools be scientifically replicated?
What is the state of the art in software development built with AI tools?
Can software built with AI tools be efficiently audited?
Can a local, free, automated project-management evaluation platform be built?
Is there a paradigm shift underway in research, software development, and project management?
What is the cost — financial and environmental — of adopting LLMs in business or project-management processes?
Can LLM capabilities on a concrete scenario be measured scientifically, alongside their environmental consequences?
Can gaps in the literature be detected using AI tools — can AI surface new problems and new solutions?
Can a research project be documented in a live, automated, reproducible, real-time way?
Can the resulting knowledge be shared publicly, without authorship, investment, or time restrictions?
Can AI tools be studied using AI tools — can the object of study also be an instrument of its own study?

PUMA is one concrete, working answer to these questions: a reproducible artefact whose construction and evidence are public end-to-end.

The two phases of PUMA's development¶

PUMA was built in two openly documented phases

1 — Research. A structured literature review (Keshav's three-pass method, PRISMA 2020 review structure), methodology design (Design Science Research), hypothesis formulation, and experimental design.

2 — Artefact construction. Two artefacts: the PUMA benchmark tool (this repository) and PUMA Community (the public submission hub). Both phases are published with full reproducibility.

How PUMA works¶

PUMA is organized as a six-layer architecture, each layer with a single responsibility:

Layer	Responsibility
Orchestrator	Drives a run-spec end to end (load → infer → evaluate → persist).
Runtime	Talks to Ollama locally, with deterministic, bounded retry.
Models	Discovers and describes the local model catalog (read-only).
Evaluation	Computes task metrics (F1-macro, MAE, calibration, latency).
Diagnostics	`puma doctor` / `puma env` — health and environment checks.
Community	Builds, validates, and publishes submissions with integrity hashes.

Results are stored in a bi-temporal SQLite database; a Streamlit dashboard visualizes runs; and CodeCarbon records energy and emissions on every run. Crucially, execution is local-only — PUMA never makes an external inference call.

What PUMA measures¶

F1-macro — issue triage (multi-class classification).
MAE — story-point estimation (numeric prediction).
Per-class precision / recall / F1 — where the model succeeds or fails.
Calibration (ECE) — when log-probabilities are available.
Latency — p50, p95, p99 per inference.
Sustainability — grams of CO₂-equivalent and kWh, via CodeCarbon.
Reproducibility — byte-identical predictions across repeated runs.

Why use PUMA?¶

100% local — code and data never leave your machine.
Reproducible — same inputs produce the same outputs, byte for byte.
Sustainable — measures and reports its own environmental impact.
Free — no API keys, no paywalls, no commercial licenses.
Open — MIT-licensed and community-contributed.
Multi-model — Qwen, Mistral, Llama, and Gemma families.
Multi-hardware — CPU-only or GPU profiles, auto-detected.
Verifiable — cryptographic integrity on every submission.
Statistically rigorous — Wilcoxon validation and falsifiable hypotheses.

Quick start¶

# 1. Install — via pip (available with the v4.0.0 release)
pip install puma-cp

# ...or via Docker (available with the v4.0.0 release)
docker pull ghcr.io/pumacp/puma:latest
curl -O https://raw.githubusercontent.com/pumacp/puma/main/docker-compose.yml && docker compose up

# ...or from source (works today)
git clone https://github.com/pumacp/puma && cd puma
pip install -e ".[dev]"

# 2. Check your environment (Ollama, models, hardware, database)
puma doctor

# 3. See which models are available locally
puma models list

# 4. Run your first benchmark
puma run specs/runs/baseline_triage.yaml

# 5. Package the result for sharing (local dry-run, no network)
puma share-results --dry-run --run-id <run_id> --yes

Install the package and its development extras.
puma doctor confirms Ollama is reachable, a model is present, and the hardware profile is detected.
puma models list shows the models pulled locally.
puma run executes a run-spec and prints Run complete: <run_id>.
puma share-results --dry-run builds a submission package on disk without touching the network.

Practical tutorials¶

1 · Baseline triage2 · Compare two models3 · Reproduce a submission4 · Check hardware5 · List models6 · Generate a submission7 · Verify integrity8 · Compare two models on the same task

puma run specs/runs/baseline_triage.yaml

Runs the canonical triage benchmark (200 instances). The summary prints F1-macro and the run id you can feed into later commands.

puma run specs/runs/baseline_triage.yaml        # model A (edit the spec's model)
puma run specs/runs/baseline_triage.yaml        # model B
puma compare <run_id_a> <run_id_b>

Lays the two runs' metrics side by side so you can see which model wins on the same task and data.

puma community pull
puma community verify-hash <submission>.json --predictions <submission>.predictions.jsonl

Downloads community submissions and re-derives the integrity hash locally to confirm a published result is exactly what it claims to be.

puma doctor

A read-only health sweep: Python, CodeCarbon, Ollama, models, hardware profile, database, and baseline specs. Exits non-zero if anything is wrong.

puma models list
puma models recommended

list shows what is pulled locally; recommended shows the curated catalog and which entries you still need to ollama pull.

puma share-results --dry-run --run-id <run_id> --yes

Produces a submission.json + predictions.jsonl package locally for review before any external publication.

puma community verify-hash submission.json --predictions predictions.jsonl

Recomputes the predictions hash and compares it to the declared value — exit 0 means verified, exit 1 means the file does not match.

puma run specs/runs/baseline_triage.yaml   # run each model you want to compare
puma dashboard                             # open http://localhost:8501

In the dashboard, pick the Multi-model view, choose a scenario, and select two or more models. You get headline metrics with deltas (F1-macro for triage, MAE for estimation), bar charts for F1-macro / MAE / p95 latency / carbon, a full metrics table, and a reproducibility check on each model's prediction fingerprint. Everything reads from persisted results — no live inference.

Use cases¶

Academic research — rigorously evaluate model capabilities for PMO tasks and publish the findings.
Pre-production evaluation — test which open-weight LLM fits your use case before committing to it.
Model comparison — benchmark several models on the same task with statistical validation.
Sustainability auditing — measure the environmental cost of AI-enabled workflows.
Reproducibility verification — independently confirm a published result.

PUMA Community¶

PUMA Community is the public archive of community-contributed benchmark results.

What it is — a public, governance-first repository of submissions.
How it works — you open a pull request with a submission JSON; CI validates it against the schema and integrity hash, auto-merges if valid, and mirrors it to Hugging Face, Zenodo, and Kaggle.
Why it matters — submissions are cryptographically verifiable, follow FAIR data principles, and receive citable DOIs.
How to contribute — see the contributing guide.
Status — browse the public leaderboard.

Research with PUMA Vault¶

PUMA Vault is the public knowledge graph behind the research process. It is built with Obsidian using PARA, GTD, and Zettelkasten methods, linking literature notes, methodology decisions, and findings into a navigable web. Browse it at pumacp.github.io/puma-vault.

Methodologies¶

Design Science Research (DSR) — for artefact construction.
Spec-Driven Development (SDD) — for the codebase.
Keshav's three-pass method — for reading the literature.
PRISMA 2020 — for the systematic-review structure.
Wilcoxon signed-rank test — for non-parametric statistical validation.
Marco Veritas protocol — disciplined AI-tool use in research: verify primary sources, never cite what cannot be checked.
APA 7th edition — for citations.

Cost analysis¶

Financial cost — zero: no API keys, no paywalls, no commercial licenses.
Environmental cost — on the gpu-entry profile, a 200-instance triage run is on the order of 0.1–0.2 gCO₂-equivalent (CodeCarbon-measured; CPU-only profiles draw more energy and take longer).
Hardware — 16 GB RAM minimum for the CPU-only profile; a GPU is optional and speeds runs up substantially.

Community¶

Discord — discord.gg/fVhcpHREJv
GitHub Discussions — on the puma-community repository.
Contribute — start with the contributing guide.
Report issues — open an issue on the relevant repository.

Resources¶

Code repositories¶

PUMA benchmark tool — https://github.com/pumacp/puma
PUMA Community — https://github.com/pumacp/puma-community
PUMA Vault — https://github.com/pumacp/puma-vault

Documentation sites¶

PUMA docs — https://pumacp.github.io/puma/
Technical reference — docs/technical_reference.md (architecture, configuration, JSON Schema, ORM, CLI overview, glossary, decisions timeline)
PUMA Community — https://pumacp.github.io/puma-community/
PUMA Vault — https://pumacp.github.io/puma-vault/
Wiki (tool) — https://github.com/pumacp/puma/wiki · Wiki (community) — https://github.com/pumacp/puma-community/wiki

Conversation¶

Discord — https://discord.gg/fVhcpHREJv
Contact — pumacapstoneproject@gmail.com

Citation¶

If you use PUMA in your work, please cite it:

@software{puma_benchmark,
  title        = {PUMA: Local, reproducible benchmarking for LLMs in ICT Project Management},
  author       = {{The PUMA Project}},
  year         = {2026},
  url          = {https://github.com/pumacp/puma},
  note         = {Zenodo DOI forthcoming}
}

APA (7th edition):

The PUMA Project. (2026). PUMA: Local, reproducible benchmarking for LLMs in ICT Project Management [Computer software]. https://github.com/pumacp/puma

Note

A Zenodo DOI is forthcoming and will be appended here after the first DOI-backed snapshot.

Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
Observed widely, these persist despite abundant historical data.
Laying a rigorous foundation requires reproducible benchmarking.
Leveraging labeled datasets enables systematic evaluation of LLM performance.
Outcomes are compared using quantitative metrics and statistical analysis.
With an incremental design, a minimal viable benchmark is defined.
Through open-source release, results become reproducible and verifiable.
Hence, the framework supports extensibility across models and tasks.
Eventually, it enables integration into real organizational settings.
Within ICT environments, recurring inefficiencies hinder effective decision-making.
Heterogeneous data sources complicate prioritization and estimation processes.
In response, this work builds a reproducible LLM-based benchmark.
The focus is on issue triage and story-point estimation tasks.
Evaluation follows controlled experiments with statistical validation.
Protocols ensure reproducibility through fixed parameters and configurations.
Using carbon tracking, the framework measures energy impact.
Moreover, the MVP delivers a valid and original contribution.
All artefacts are released as open source for replication and extension.

PUMA is released under the MIT License. Built with MkDocs Material.