Skip to content

Home

PUMA logo

Public submission hub for community-contributed local-LLM benchmark results in ICT Project Management.

Validate submissions Docs CI License: MIT Schema v1.0.0
Submissions Models Scenarios
Hugging Face dataset Leaderboard Zenodo


Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
Observed widely, these persist despite abundant historical data.
Laying a rigorous foundation requires reproducible benchmarking.
Leveraging labeled datasets enables systematic evaluation of LLM performance.
Outcomes are compared using quantitative metrics and statistical analysis.
With an incremental design, a minimal viable benchmark is defined.
Through open-source release, results become reproducible and verifiable.
Hence, the framework supports extensibility across models and tasks.
Eventually, it enables integration into real organizational settings.
Within ICT environments, recurring inefficiencies hinder effective decision-making.
Heterogeneous data sources complicate prioritization and estimation processes.
In response, this work builds a reproducible LLM-based benchmark.
The focus is on issue triage and story-point estimation tasks.
Evaluation follows controlled experiments with statistical validation.
Protocols ensure reproducibility through fixed parameters and configurations.
Using carbon tracking, the framework measures energy impact.
Moreover, the MVP delivers a valid and original contribution.
All artefacts are released as open source for replication and extension.


Submit your results · Browse the leaderboard · Read the schema

What is PUMA Community?

PUMA Community is the public, cryptographically-verifiable archive of community-contributed benchmark results produced by the PUMA benchmark tool. Anyone can run PUMA on their own hardware, generate a submission, and publish it here for others to discover, cite, and reproduce.

The hub is serverless by design: all of its infrastructure runs on free services — GitHub Actions for validation and merge, Hugging Face Spaces for the leaderboard and verifier, and Zenodo for DOI-backed archival. Submissions are auto-validated against a JSON Schema, auto-merged when valid, and mirrored outward to external archives so downstream researchers and tool builders can find them where they already work.

Why a public submission hub?

  • Cryptographic integrity — every submission carries a deterministic SHA-256 hash over its predictions, recomputable and verifiable by anyone.
  • FAIR data — findable (Hugging Face mirror), accessible (CC-BY-4.0), interoperable (JSON Schema), reusable (open, forever).
  • Citable — DOI-backed Zenodo snapshots make every submission academically citable.
  • Reproducible — each submission records seed, temperature, model version, hardware profile, and sustainability cost.
  • Open — zero vendor lock-in, zero paid API dependencies, MIT-licensed.

The submission pipeline

puma share-results  →  PR (submissions/<id>.json)  →  validate-submission CI
        │                                                      │
        │                                              valid?  ├─ no → "invalid" label + comment
        ▼                                                      ▼ yes
  local JSON package                                    "valid" label
                                          auto-merge-valid  →  main  →  update-badges
                              mirrors (HF / Zenodo / Kaggle) ──┤── notifiers (Discord / Telegram)
                                              verify-submission → <id>.verified.json
  1. Run puma share-results --dry-run locally — this generates the submission JSON.
  2. Open a pull request adding submissions/<id>.json.
  3. The validate-submission workflow checks schema, filename, and integrity hash.
  4. Valid PRs receive the valid label.
  5. The auto-merge-valid workflow squash-merges them into main.
  6. The update-badges workflow refreshes the live counters.
  7. Mirror workflows (when secrets are configured) propagate to Hugging Face, Zenodo, and Kaggle.
  8. Notify workflows (when secrets are configured) announce to Discord and Telegram.
  9. The verify-submission sidecar computes an independent verification badge.

Submission format

Each submission is a single JSON document conforming to schema v1.0.0:

  • Identificationsubmission_id (UUIDv4), schema_version, puma_version.
  • Submitter consent — explicit CC-BY-4.0 release flags.
  • Run metadata — scenario, model, strategy, seed, temperature.
  • Hardware profile — a canonical profile_id from the PUMA catalog.
  • Metrics — F1-macro, MAE, accuracy.
  • Sustainability — CodeCarbon-measured emissions.
  • Integritypredictions_summary_hash.

See the submission format guide for the full field-by-field tour.

Validation guarantees

Every merged submission satisfies three guarantees

  1. Schema conformance — validates against schema/submission.v1.json (JSON Schema Draft 2020-12).
  2. Filename consistency — the file name must match the submission_id field.
  3. Integritypredictions_summary_hash is recomputed server-side and compared to the declared value.

PRs that fail any guarantee receive the invalid label with a sticky comment summarizing the failure.

The mirror network

Channel Target Status
Hugging Face Datasets pumaproject/puma-community-submissions mirror active when its secret is configured
Zenodo community pumacp sandbox validated; production pending the first DOI
Kaggle dataset pumacp/puma-community-submissions prepared, dormant — activated by trigger

Each mirror has its own GitHub Actions workflow under .github/workflows/, runs on its own schedule, and is gated by the secret it requires.

The verifier pipeline

Verification is independent of the original submitter:

  • An independent verifier replicates the byte-identical hashing algorithm from the PUMA client.
  • The verify-submission workflow detects new submissions via git diff and invokes the verifier.
  • Each submission gets a sidecar <id>.verified.json next to it.
  • Verification status renders as a badge in the leaderboard.

Trust model: cryptographic hashing makes tampering detectable, and the verifier is independent of the submitter — so a published result can be trusted without trusting the person who submitted it.

How to contribute

# 1. Run PUMA locally and generate a submission
puma run specs/runs/baseline_triage.yaml
puma share-results

# 2. Fork puma-community and create a branch
gh repo fork pumacp/puma-community
cd puma-community && git checkout -b my-submission

# 3. Add the submission JSON
cp ~/.puma/submissions/<id>.json submissions/<id>.json

# 4. Validate locally (optional but recommended)
python -m jsonschema -i submissions/<id>.json schema/submission.v1.json

# 5. Open the PR
git add submissions/<id>.json && git commit -m "Add submission <id>"
git push origin my-submission && gh pr create --fill

See the contributing guide for the long-form walkthrough.

The community

Trust model & Code of Conduct

  • All submissions are released under CC-BY-4.0 with attribution.
  • The project enforces the Contributor Covenant v2.1.
  • Personal-data scanning runs client-side in puma share-results, before a submission payload is ever constructed.
  • The CI's defense-in-depth is intentionally narrow (schema + filename + hash), so the recommended client path remains the trusted source.

Roadmap

The hub grows along trigger-based horizons rather than fixed dates:

Horizon Milestone Trigger Status
H1 Hub live, CI green, docs published Public launch complete
H2 First external community submissions Outside contributors open PRs pending external submissions
H3 DOI-backed snapshots First Zenodo production deposit planned
H4 Mirror activation (HF / Zenodo / Kaggle) Secrets configured designed
H5 Notifications (Discord / Telegram) Webhook/bot secrets configured designed
H6 Verifier at scale Sustained submission volume designed

Resources

Code repositories

Documentation sites

Hugging Face Hub

Persistent archives & catalogs

Knowledge management & research

Conversation

Citation

If you use PUMA Community submissions as a data source, please cite the archive:

@misc{puma_community,
  title        = {PUMA Community: a public archive of community-contributed LLM benchmark results for ICT Project Management},
  author       = {{The PUMA Project}},
  year         = {2026},
  howpublished = {\url{https://github.com/pumacp/puma-community}},
  note         = {Zenodo DOI forthcoming}
}

Note

A Zenodo DOI is forthcoming and will be appended here after the first DOI-backed snapshot.



Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
Observed widely, these persist despite abundant historical data.
Laying a rigorous foundation requires reproducible benchmarking.
Leveraging labeled datasets enables systematic evaluation of LLM performance.
Outcomes are compared using quantitative metrics and statistical analysis.
With an incremental design, a minimal viable benchmark is defined.
Through open-source release, results become reproducible and verifiable.
Hence, the framework supports extensibility across models and tasks.
Eventually, it enables integration into real organizational settings.
Within ICT environments, recurring inefficiencies hinder effective decision-making.
Heterogeneous data sources complicate prioritization and estimation processes.
In response, this work builds a reproducible LLM-based benchmark.
The focus is on issue triage and story-point estimation tasks.
Evaluation follows controlled experiments with statistical validation.
Protocols ensure reproducibility through fixed parameters and configurations.
Using carbon tracking, the framework measures energy impact.
Moreover, the MVP delivers a valid and original contribution.
All artefacts are released as open source for replication and extension.


PUMA Community is released under the MIT License. Built with MkDocs Material. See also the PUMA benchmark tool docs.