First official PUMA Community submission¶
On 2026-05-31 the PUMA project recorded its inaugural production submission to the PUMA Community leaderboard. This page documents that landmark end to end: the exact identifiers, the path the submission actually took, how to reproduce it, and the known infrastructure gaps that were deferred to a later sprint.
What landed¶
The inaugural submission benchmarks a small local model on the triage task and establishes the first reproducible data point on the public leaderboard.
| Field | Value |
|---|---|
| Model | qwen2.5:3b |
| Scenario | triage_jira |
| Strategy | zero_shot |
| Sample size | 200 |
| Seed | 42 |
| Temperature | 0.0 |
| F1-macro | 0.3898 |
| Accuracy | 0.465 |
| Latency p50 / p95 | 199 ms / 208 ms |
| Hardware profile | gpu-entry (12-core x86_64, 31 GB RAM, NVIDIA RTX 2060 6 GB, Linux) |
| Energy (CodeCarbon) | 0.116 g CO2, 0.000668 kWh, country ESP |
| PUMA version | 3.1.0 |
| Submitter alias | pumacp |
| Submitted at | 2026-05-31T09:16:01.064478Z |
Submission identifiers¶
run_id: baseline_triage_zero_shot_s12_n1__83ec5feaa8df4844__20260531T091417
submission_id: 1d88e49b-5b49-46b9-a8a6-df7bdd5bf80b
predictions_summary_hash: f60423ca6a6e9b033f0f89ac5a5a127d889a6e2627fc07c480c44bfdf53857ec
merge_sha: 111cee36af2b9657ee702846727951f32af44462
merged_at: 2026-05-31T12:43:48Z
The predictions_summary_hash is a 64-character hex digest computed
deterministically from the per-item predictions. Anyone can recompute it from
the published artefact and confirm it matches, which is what makes the result
verifiable independent of any workflow status.
The maintainer-driven submission path¶
This first submission did not flow through the automated puma share-results
PR-opening step. During the share phase the CLI generated the in-memory review
panel and then hung, never opening a pull request. To unblock the landmark, a
maintainer took the canonical JSON and JSONL artefacts emitted by
puma share-results --dry-run — the same artefacts the automated path would
have pushed — and submitted them by hand: fork, push, and an admin-bypass merge.
We document this honestly as the maintainer-driven inaugural submission path.
The data is canonical; only the delivery mechanism was manual. Future
submissions are expected to use puma share-results once the CLI hang is
investigated and fixed (tracked as deferred debt below).
Propagation chain¶
Once the submission pull request merged, the downstream propagation chain ran:
- Submission PR merged into
puma-community/mainat merge SHA111cee36. - Hugging Face dataset mirror — the submission JSON was mirrored to the
community submissions dataset (file
submissions/1d88e49b-5b49-46b9-a8a6-df7bdd5bf80b.json). - update-badges workflow — succeeded, refreshing the repository badges.
- Leaderboard Space — the new row became visible on the public leaderboard Space after its cache window elapsed (LRU TTL roughly 5-15 min).
F1 as the zero_shot floor anchor¶
The reported F1-macro of 0.3898 is the floor anchor for the zero_shot
strategy on triage_jira. It is the first reproducible zero-shot data point and
gives later strategies a fixed baseline to improve against.
This figure is not a regression against the higher contextual-anchoring baseline. Those are different strategies with different prompting depths: zero-shot supplies no in-context examples, so a lower score is expected and correct. Comparing the two head to head would be a category error. The value of this anchor is precisely that it pins the simplest strategy at a known, recomputable level.
Reproducibility¶
The submission is fully reproducible. The decisive parameters are fixed: seed 42
and temperature 0.0, sample size 200, on PUMA version 3.1.0. Re-running the same
scenario, strategy, and model against the same seed reproduces the predictions,
and recomputing the digest yields the same predictions_summary_hash
(f60423ca...53857ec). Because the automated integrity verifier workflow failed
on this run, the submission JSON carries an integrity verification status of
self-attested; verification nonetheless remains mathematically valid via
deterministic recomputation of the hash from the published predictions.
Known workflow gaps (deferred to S12.19)¶
Three submission-side workflows did not run to green on this inaugural landing. None affects the validity of the data; all are infrastructure issues deferred to S12.19:
- validate-submission — failed at the "Prepare all required actions" step
on a broken action reference (
actions-ecosystem/action-add-labels@v1.4.0, which was never published); fix by pinning to@v1.3.0or a current commit SHA. Tracked as D38. - Verify submission integrity — failed with
TypeError: Client.__init__() got an unexpected keyword argument 'hf_token', agradio_clientAPI drift (hf_token=should now betoken=); this is why the inaugural submission carriesverification_status="self-attested"rather than"verified". Tracked as D39. - notify-discord — failed for a missing webhook secret; deferred to the post-Sprint-12 backlog (not a code defect).
The puma share-results CLI hang that forced the manual path is tracked as
D40.
Cross-repository link map¶
| Resource | Location |
|---|---|
| Submission pull request | https://github.com/pumacp/puma-community/pull/8 |
| Benchmark framework repository | https://github.com/pumacp/puma |
| Hugging Face submissions dataset | https://huggingface.co/datasets/pumaproject/puma-community-submissions |
| Hugging Face leaderboard Space | https://huggingface.co/spaces/pumaproject/puma-leaderboard |
| Mirrored artefact | submissions/1d88e49b-5b49-46b9-a8a6-df7bdd5bf80b.json |
See also the publication workflow and the technical reference for the full submission lifecycle.