Cross-architecture reproducibility¶
Open question¶
Are PUMA's task-level metrics (F1-macro, MAE) and per-token logprobs
bit-exact between x86_64 Linux (validation environment) and arm64
macOS (Apple Silicon, native mode) under PUMA's canonical configuration
(T=0.0, seed=42, same Ollama model digest)?
Current status¶
Empirically unverified as of v2.6.0. The detection and dispatch infrastructure to enable a future empirical test is in place:
puma.preflight.apple_silicon.detect_apple_silicon_profilemaps the 10 catalogued M-series chip variants toapple-silicon-*profile identifiers.config/profiles.yamldeclares all 9 apple-silicon-m3/m4/m5 profiles withempirical_validation: pending.start_puma.sh --nativeboots a native Ollama (Metal accelerated)- Python venv on macOS, enabling Mode B end-to-end.
puma validate-baseline --expected-f1and--expected-mae(from v2.5.0) can run unchanged on Mode B and produce comparable numbers to the Linux reference.
No empirical comparison has been performed — the project's validation
hardware (RTX 2060 Mobile 6 GB, gpu-entry) does not extend to Apple
Silicon.
Theoretical expectations¶
| Metric | Expected behaviour | Confidence |
|---|---|---|
f1_macro |
Bit-exact (±0.000) across architectures | High |
mae |
Bit-exact (±0.0001) across architectures | High |
| Per-token logprobs | Differ by FP rounding (~1e-5 to 1e-3) | Medium |
| Per-prediction confidence (derived) | Differ marginally | Medium |
latency_p95 |
Differ substantially (GPU class) | Certain |
co2_kg, kwh |
Differ substantially (hardware + tracker) | Certain |
Why f1 and mae are expected bit-exact¶
Q4_K_M quantisation (Ollama's default) performs inference in
quantised integer arithmetic at the matmul level. The token-id
sequence emitted by greedy decoding (temperature=0.0) is therefore
deterministic at the level of token equality, regardless of the
underlying SIMD instruction set (AVX2 on x86_64, NEON on arm64).
Both f1_macro and mae are computed from the parsed token output,
so they should agree bit-exactly when the same token sequence is
produced.
The triage parser maps the first valid label token to one of 5 classes; the estimation parser maps the first valid integer literal to a story-point value. Neither parser depends on continuous floating-point comparisons — only on token equality.
Why logprobs are expected to differ¶
Logprobs are continuous floating-point values derived from the
softmax of the model's final-layer logits. Softmax includes
exponentiation and normalisation, both of which involve
floating-point arithmetic in f32 or f16 (depending on the
quantised path). Differences in instruction-level FP semantics
(AVX2 FMA vs NEON FMA, vendor-specific fast-math flags, kernel
choices in Metal Performance Shaders vs CUDA) can produce
non-identical bit patterns even when the model weights and inputs
are byte-identical.
Implication: ECE (Expected Calibration Error) — which is computed from logprobs — may differ slightly between Linux and macOS. The magnitude of the difference is the open empirical question.
Hypothesis for future empirical work¶
- H0:
f1_macroandmaeare bit-exact across architectures underT=0.0+seed=42+ same model digest, on the canonical baselines. - H1: At least one of
f1_macroormaediffers by more than ±0.000 across architectures under the same configuration.
A secondary hypothesis on logprobs:
- H2: The mean absolute difference between paired logprobs (same instance, same prompt) across architectures is below 1e-3.
- H3: ECE differs by less than ±0.01 across architectures on the canonical triage baseline.
Testing protocol when Mac hardware becomes available¶
- Acquire / borrow a MacBook with one of the M-series variants
catalogued in
config/profiles.yaml. Note the chip brand exactly assysctl -n machdep.cpu.brand_stringreports it. brew install ollamaand verify the daemon starts.- Pull
qwen2.5:3bnatively (ollama pull qwen2.5:3b). Compare the digest against the Linux reference (ollama show qwen2.5:3b). If the digests differ, the comparison is invalidated — Ollama's distribution layer may serve different quantisations across architectures. - Clone PUMA,
./start_puma.sh --native, and run: - Record:
f1_macroandmae(PASS/FAIL with delta).parse_failure_rate(should be 0.0; non-zero is a parser-level cross-arch issue, not a deterministic-FP issue).puma list-runs --last-n 1 --jsonto capture the exact run_id and timestamps.- Per-token logprob comparison (advanced):
- Re-run both baselines with
inference.logprobs: truein the spec. - Export
predictions.logprobs_jsonfrom both DBs. - Compute mean absolute difference per token across paired instances.
- Compute ECE on both runs and report the delta.
Results from this protocol close H0/H1 and partially close H2/H3.
File an issue or PR documenting the findings; if H0 holds, update
this document's status section to "empirically verified" and add
the Mac hardware row to docs/HARDWARE.md's tolerance table.
Implication for v2.6.0 users¶
Users running PUMA in macOS Mode B (native Ollama, Metal accelerated) should NOT assume their F1/MAE results are directly comparable to Linux baselines without first running the protocol above on their specific hardware. The canonical Linux references remain authoritative:
- triage_jira F1 = 0.5867 (qwen2.5:3b + contextual-anchoring, N=200)
- estimation_tawos MAE = 5.7150 SP (qwen2.5:3b + zero-shot, N=200)
A Mode B user who observes f1 = 0.5867 is reproducing the result
exactly; a user who observes f1 = 0.5712 should suspect either a
model-digest mismatch (verify with ollama show) or a genuine
cross-arch FP-ordering divergence (H1 holds for their specific
configuration). Either outcome is informative and would benefit the
project — please report it.
Related documents¶
MACOS_NOTES.md— Mode A vs Mode B operational notes.HARDWARE.md— gpu-entry tolerance table; Apple Silicon row currently points back to this document.baseline_references.md— canonical empirical baselines for both scenarios.CATALOG_HISTORY.md—catalog_version2.6.0 entry documenting the apple-silicon-* additions.