Home

PUMA logo

Public submission hub for community-contributed local-LLM benchmark results in ICT Project Management.

Submissions Models Scenarios

Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
Observed widely, these persist despite abundant historical data.
Laying a rigorous foundation requires reproducible benchmarking.
Leveraging labeled datasets enables systematic evaluation of LLM performance.
Outcomes are compared using quantitative metrics and statistical analysis.
With an incremental design, a minimal viable benchmark is defined.
Through open-source release, results become reproducible and verifiable.
Hence, the framework supports extensibility across models and tasks.
Eventually, it enables integration into real organizational settings.
Within ICT environments, recurring inefficiencies hinder effective decision-making.
Heterogeneous data sources complicate prioritization and estimation processes.
In response, this work builds a reproducible LLM-based benchmark.
The focus is on issue triage and story-point estimation tasks.
Evaluation follows controlled experiments with statistical validation.
Protocols ensure reproducibility through fixed parameters and configurations.
Using carbon tracking, the framework measures energy impact.
Moreover, the MVP delivers a valid and original contribution.
All artefacts are released as open source for replication and extension.

Submit your results · Browse the leaderboard · Read the schema

What is PUMA Community?¶

PUMA Community is the public, cryptographically-verifiable archive of community-contributed benchmark results produced by the PUMA benchmark tool. Anyone can run PUMA on their own hardware, generate a submission, and publish it here for others to discover, cite, and reproduce.

The hub is serverless by design: all of its infrastructure runs on free services — GitHub Actions for validation and merge, Hugging Face Spaces for the leaderboard and verifier, and Zenodo for DOI-backed archival. Submissions are auto-validated against a JSON Schema, auto-merged when valid, and mirrored outward to external archives so downstream researchers and tool builders can find them where they already work.

Why a public submission hub?¶

Cryptographic integrity — every submission carries a deterministic SHA-256 hash over its predictions, recomputable and verifiable by anyone.
FAIR data — findable (Hugging Face mirror), accessible (CC-BY-4.0), interoperable (JSON Schema), reusable (open, forever).
Citable — DOI-backed Zenodo snapshots make every submission academically citable.
Reproducible — each submission records seed, temperature, model version, hardware profile, and sustainability cost.
Open — zero vendor lock-in, zero paid API dependencies, MIT-licensed.

The submission pipeline¶

puma share-results  →  PR (submissions/<id>.json)  →  validate-submission CI
        │                                                      │
        │                                              valid?  ├─ no → "invalid" label + comment
        ▼                                                      ▼ yes
  local JSON package                                    "valid" label
                                                               │
                                          auto-merge-valid  →  main  →  update-badges
                                                               │
                              mirrors (HF / Zenodo / Kaggle) ──┤── notifiers (Discord / Telegram)
                                                               │
                                              verify-submission → <id>.verified.json

Run puma share-results --dry-run locally — this generates the submission JSON.
Open a pull request adding submissions/<id>.json.
The validate-submission workflow checks schema, filename, and integrity hash.
Valid PRs receive the valid label.
The auto-merge-valid workflow squash-merges them into main.
The update-badges workflow refreshes the live counters.
Mirror workflows (when secrets are configured) propagate to Hugging Face, Zenodo, and Kaggle.
Notify workflows (when secrets are configured) announce to Discord and Telegram.
The verify-submission sidecar computes an independent verification badge.

Submission format¶

Each submission is a single JSON document conforming to schema v1.0.0:

Identification — submission_id (UUIDv4), schema_version, puma_version.
Submitter consent — explicit CC-BY-4.0 release flags.
Run metadata — scenario, model, strategy, seed, temperature.
Hardware profile — a canonical profile_id from the PUMA catalog.
Metrics — F1-macro, MAE, accuracy.
Sustainability — CodeCarbon-measured emissions.
Integrity — predictions_summary_hash.

See the submission format guide for the full field-by-field tour.

Validation guarantees¶

Every merged submission satisfies three guarantees

Schema conformance — validates against schema/submission.v1.json (JSON Schema Draft 2020-12).
Filename consistency — the file name must match the submission_id field.
Integrity — predictions_summary_hash is recomputed server-side and compared to the declared value.

PRs that fail any guarantee receive the invalid label with a sticky comment summarizing the failure.

The mirror network¶

Channel	Target	Status
Hugging Face Datasets	`pumaproject/puma-community-submissions`	mirror active when its secret is configured
Zenodo community	`pumacp`	sandbox validated; production pending the first DOI
Kaggle dataset	`pumacp/puma-community-submissions`	prepared, dormant — activated by trigger

Each mirror has its own GitHub Actions workflow under .github/workflows/, runs on its own schedule, and is gated by the secret it requires.

The verifier pipeline¶

Verification is independent of the original submitter:

An independent verifier replicates the byte-identical hashing algorithm from the PUMA client.
The verify-submission workflow detects new submissions via git diff and invokes the verifier.
Each submission gets a sidecar <id>.verified.json next to it.
Verification status renders as a badge in the leaderboard.

Trust model: cryptographic hashing makes tampering detectable, and the verifier is independent of the submitter — so a published result can be trusted without trusting the person who submitted it.

How to contribute¶

# 1. Run PUMA locally and generate a submission
puma run specs/runs/baseline_triage.yaml
puma share-results

# 2. Fork puma-community and create a branch
gh repo fork pumacp/puma-community
cd puma-community && git checkout -b my-submission

# 3. Add the submission JSON
cp ~/.puma/submissions/<id>.json submissions/<id>.json

# 4. Validate locally (optional but recommended)
python -m jsonschema -i submissions/<id>.json schema/submission.v1.json

# 5. Open the PR
git add submissions/<id>.json && git commit -m "Add submission <id>"
git push origin my-submission && gh pr create --fill

See the contributing guide for the long-form walkthrough.

The community¶

Discord — discord.gg/fVhcpHREJv
GitHub Discussions — on this repository.
Contribute — start with the contributing guide.
Report issues — open an issue on the relevant repository.

Trust model & Code of Conduct¶

All submissions are released under CC-BY-4.0 with attribution.
The project enforces the Contributor Covenant v2.1.
Personal-data scanning runs client-side in puma share-results, before a submission payload is ever constructed.
The CI's defense-in-depth is intentionally narrow (schema + filename + hash), so the recommended client path remains the trusted source.

Roadmap¶

The hub grows along trigger-based horizons rather than fixed dates:

Horizon	Milestone	Trigger	Status
H1	Hub live, CI green, docs published	Public launch	complete
H2	First external community submissions	Outside contributors open PRs	pending external submissions
H3	DOI-backed snapshots	First Zenodo production deposit	planned
H4	Mirror activation (HF / Zenodo / Kaggle)	Secrets configured	designed
H5	Notifications (Discord / Telegram)	Webhook/bot secrets configured	designed
H6	Verifier at scale	Sustained submission volume	designed

Resources¶

Conversation¶

Discord — https://discord.gg/fVhcpHREJv
Contact — pumacapstoneproject@gmail.com

Citation¶

If you use PUMA Community submissions as a data source, please cite the archive:

@misc{puma_community,
  title        = {PUMA Community: a public archive of community-contributed LLM benchmark results for ICT Project Management},
  author       = {{The PUMA Project}},
  year         = {2026},
  howpublished = {\url{https://github.com/pumacp/puma-community}},
  note         = {Zenodo DOI forthcoming}
}

Note

A Zenodo DOI is forthcoming and will be appended here after the first DOI-backed snapshot.

Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
Observed widely, these persist despite abundant historical data.
Laying a rigorous foundation requires reproducible benchmarking.
Leveraging labeled datasets enables systematic evaluation of LLM performance.
Outcomes are compared using quantitative metrics and statistical analysis.
With an incremental design, a minimal viable benchmark is defined.
Through open-source release, results become reproducible and verifiable.
Hence, the framework supports extensibility across models and tasks.
Eventually, it enables integration into real organizational settings.
Within ICT environments, recurring inefficiencies hinder effective decision-making.
Heterogeneous data sources complicate prioritization and estimation processes.
In response, this work builds a reproducible LLM-based benchmark.
The focus is on issue triage and story-point estimation tasks.
Evaluation follows controlled experiments with statistical validation.
Protocols ensure reproducibility through fixed parameters and configurations.
Using carbon tracking, the framework measures energy impact.
Moreover, the MVP delivers a valid and original contribution.
All artefacts are released as open source for replication and extension.

PUMA Community is released under the MIT License. Built with MkDocs Material. See also the PUMA benchmark tool docs.