Open Questions¶

Design decisions taken during implementation that were not explicitly specified, documented here for review.

Resolved¶

Q1 — Response parsing fallback policy¶

Question: When a model does not follow the output format, should the prediction be (a) excluded (None), (b) assigned an "unknown" class, or (c) retried?

Decision: Return None and exclude from metric computation. The parse_failure_rate metric tracks this separately.

Rationale: Assigning "unknown" would pollute F1 and accuracy with a synthetic class. Retries were not implemented to keep inference time predictable. The parse_failure_rate metric makes parse failures visible without distorting task metrics.

Q2 — pytest.ini vs pyproject.toml¶

Question: Both pytest.ini and [tool.pytest.ini_options] in pyproject.toml exist. Which takes precedence?

Decision: pytest.ini is kept as the canonical config because the Docker Python 3.11 environment resolves it preferentially over pyproject.toml.

Resolution: Both files are kept in sync manually. pytest.ini is the authoritative source.

Q3 — Prompt language¶

Question: Should prompts be in English or Spanish?

Decision: All prompts are in English. The Jinja2 templates in specs/prompts/ use English. Legacy src/evaluate_*.py files (pre-v2) used Spanish prompts but are excluded from the active pipeline.

Q4 — TAWOS SQL parsing¶

Question: The db/TAWOS.sql file is 4.3 GB with 1 004 INSERT batches. Runtime parsing is impractical.

Decision: Use data/tawos_clean.csv (9 020 rows, pre-processed) as the canonical runtime artifact. The SQL file is kept as a source-of-truth reference. A one-time conversion script can regenerate the CSV.

Q5 — OllamaClient sync vs async¶

Question: The Runner orchestration loop uses generate_sync(). Should it use async generate()?

Decision: Sync is used to avoid event-loop complexity in the sequential orchestration loop. The async variant is implemented and available for future parallel batch execution.

Q6 — Inference cache invalidation¶

Question: When a model is updated (new weights pulled for the same tag), cached responses may be stale.

Decision: Cache keys include (model_tag, prompt_hash, temperature, seed) but not a model version hash (Ollama does not expose one in the API). Users must run puma cache clear after pulling a model update.

Status: Acceptable for v2.0.0. A future improvement would query the model digest from ollama show <model> and include it in the cache key.

Open (unresolved)¶

Q7 — Optimal sample size per scenario on `cpu-standard`¶

Question: What sample size gives statistically reliable F1 estimates within the 30-minute gate time on cpu-standard?

Observation: With qwen2.5:3b and zero-shot, 50 instances takes ~8–12 minutes on cpu-standard. A 200-instance run would take ~30–45 minutes — above the gate limit for two models.

Proposed answer: 50 instances per model per strategy for the gate run; 200 instances for publication-quality results.

Status: Needs empirical validation on a cpu-standard machine.

Q8 — Model warm-up¶

Question: Should the first inference call be discarded (warm-up round)?

Observation: The first call to a freshly loaded model includes the model load time (load_duration_ns in the Ollama response). Subsequent calls are faster.

Current behaviour: All calls are included in latency metrics. The parse_ollama_timings() function exposes load_ms separately so downstream analysis can distinguish cold vs warm calls.

Proposed answer: Expose is_warm flag in the prediction row and compute separate latency distributions. Not yet implemented.

Q9 — Logprob support detection in preflight¶

Question: Ollama ≥ 0.12.11 is required for logprob extraction. Should puma preflight block runs with logprobs: true on older versions?

Current behaviour: If logprobs are requested but not supported, Ollama returns an empty logprobs field. The calibration metrics are then skipped (no data).

Proposed improvement: Add a preflight check that compares ollama_version against 0.12.11 and emits an ERROR severity issue when logprobs: true is requested on an incompatible version.

Status: Not yet implemented in puma.preflight.provisioning.