Canonical baseline references¶
This document records the canonical empirical baselines that
puma validate-baseline uses as its reference values. Each entry
fixes a (scenario, model, strategy, N, seed, T) configuration to a
single number (F1-macro for triage, MAE for estimation) measured
on PUMA's validation hardware (RTX 2060 Mobile 6 GB, gpu-entry
profile). Subsequent releases must reproduce these values within the
documented tolerance, otherwise puma validate-baseline exits
non-zero.
When a reference baseline is established or refreshed, the row in the
table below is updated, the run_id is recorded for traceability,
and the release notes for that version document the change explicitly.
triage_jira¶
| Field | Value |
|---|---|
| Spec | specs/runs/baseline_triage.yaml |
| Scenario | triage_jira |
| Model | qwen2.5:3b |
| Strategy | contextual-anchoring |
| Sample size | 200 |
| Seed | 42 |
| Temperature | 0.0 |
| Reference F1-macro | 0.5867 |
| Tolerance | ±0.01 |
| Hardware | gpu-entry (RTX 2060 Mobile 6 GB) |
| Established | v2.0.0 |
Invocation:
puma validate-baseline
# or, equivalently:
puma validate-baseline --expected-f1 0.5867 --tolerance 0.01
estimation_tawos¶
| Field | Value |
|---|---|
| Spec | specs/runs/baseline_estimation_canonical.yaml |
| Scenario | estimation_tawos |
| Model | qwen2.5:3b |
| Strategy | zero-shot |
| Sample size | 200 |
| Seed | 42 |
| Temperature | 0.0 |
| Reference MAE | 5.7150 SP |
| Tolerance | ±0.05 SP |
| Hardware | gpu-entry (RTX 2060 Mobile 6 GB) |
| Established | v2.5.0 |
| Establishing run | baseline_estimation_canonical_v1__26d0e07aaa7949ec__20260516T003317 |
| Duration | 55.2 s |
| Parse failure rate | 0.0000 |
Auxiliary metrics from the establishing run: mdae=4.0000,
rmse=8.0078, n_samples=200.
Invocation:
# Recommended: restart Ollama first to guarantee a fresh model state.
docker compose restart puma_ollama
puma validate-baseline \
--spec specs/runs/baseline_estimation_canonical.yaml \
--expected-mae 5.7150 --tolerance 0.05
Protocol note — fresh Ollama state¶
Empirically (v2.5.0, four-run verification), MAE = 5.7150 is bit-exact
across cold-start and warm-state runs of the canonical estimation
baseline when the only recent activity is estimation_tawos itself.
A triage_jira run executed between an Ollama restart and the
estimation validation perturbs the model's KV-cache state and shifts
MAE to ≈ 6.3150 (delta = +0.6 SP) — a regression well outside the
±0.05 tolerance.
Validation runs should therefore be performed from a fresh Ollama
state. If both triage and estimation baselines need to be validated
in one session, restart puma_ollama between them, e.g.:
docker compose restart puma_ollama
puma validate-baseline # F1 path, default triage spec
docker compose restart puma_ollama
puma validate-baseline \
--spec specs/runs/baseline_estimation_canonical.yaml \
--expected-mae 5.7150 --tolerance 0.05
The cross-scenario state contamination is a property of Ollama's
inference engine and is independent of PUMA's code path — see
docs/known_debt.md D3 for the broader CUDA non-determinism context.
Why this differs from Phase B numbers¶
The Phase B analysis (docs/results/phase_b_analysis.md) reports
qwen2.5:3b | estimation MAE | 2.91 on a smaller sample drawn from
the TAWOS dataset. The v2.5.0 reference (MAE = 5.7150 SP) uses
N=200 and the canonical seed/temperature/strategy combination
fixed in baseline_estimation_canonical.yaml. The two numbers are
not directly comparable: TAWOS story-point distributions are
long-tailed, so MAE is sensitive to which specific instances are
sampled.
The v2.5.0 reference is the contract going forward. The Phase B
number remains historically valid as a cross-model comparison
within that sweep but is not used by puma validate-baseline.
Tolerance philosophy¶
Under T=0.0 + seed=42 + a fixed model digest, both F1 and MAE
should be bit-exact reproducible in warm state on the same hardware.
The published tolerances (±0.01 for F1, ±0.05 for MAE) absorb the
known cold-vs-warm drift (≤ 0.006 for F1 per the D3 analysis) plus a
safety margin for floating-point ordering differences across
equivalent NVIDIA hardware (see docs/HARDWARE.md gpu-entry
tolerance table).
If puma validate-baseline returns FAIL with delta outside the
documented tolerance, the cause is typically one of:
- A model version mismatch (run
ollama show qwen2.5:3band compare digest). - A non-determinism regression in the code path
(
puma.scenarios,puma.metrics,puma.runtime). - A hardware tier mismatch (e.g., running on CPU when the
reference was established on
gpu-entry).
See docs/known_debt.md for the live status of any open
reproducibility-related debt.
Refreshing a reference¶
To refresh a reference baseline in a future release:
- Run the canonical spec on the validation hardware in warm state.
- Verify
parse_failure_rate = 0and the run completes without errors. - Record the new
run_idand value in this document. - Update the relevant default in
src/puma/cli.pyonly if the change reflects an intentional methodology shift (e.g., changing the canonical strategy from zero-shot to contextual-anchoring for estimation). Otherwise, leave the CLI default as the user-facing reference and document the new value here. - Note the refresh in
CHANGELOG.mdand the release notes.