Testing — structure, coverage, and execution¶
This document describes PUMA's test organisation and records the coverage breakdown by module group as of v2.5.0. It exists so that contributors can see where tests are dense and where gaps are acceptable, rather than guessing from a single global number.
Test layout¶
| Directory | Purpose | Typical fixtures |
|---|---|---|
tests/unit/ |
Pure-Python unit tests; no Ollama, no Docker volumes. | Module-local YAML, in-memory SQLite |
tests/integration/ |
Cross-module integration. Sub-marked @pytest.mark.ollama when a live Ollama server is required. |
puma.db fixture, real Ollama for marked tests |
tests/cli/ |
One file per puma <command>; covers --help exposure, happy path, error paths. |
Typer CliRunner, monkeypatch of _run_baseline_for_validation and similar helpers |
tests/smoke/ |
End-to-end smoke runs gated by long-running marks. | Real Ollama, small N, abbreviated specs |
Markers¶
| Marker | Meaning | Default behaviour |
|---|---|---|
@pytest.mark.unit |
Pure-Python; no external dependencies. | Always selected |
@pytest.mark.integration |
Cross-module; may touch the filesystem. | Always selected |
@pytest.mark.ollama |
Requires a live Ollama server with at least one pulled model. | Deselected by default (-m "not ollama" in CI's lint-and-test job). Run by the separate integration-tests-ollama CI job on push to main/develop (see .github/workflows/lint-and-test.yml, added in v2.5.0 — I8). |
To run every test locally including Ollama-marked:
To run only the deselect-by-default suite (the same set CI's lint-and-test job runs):
Coverage by module group — v2.5.0¶
Measured with pytest --cov=puma -m "not ollama" on commit
d5996f1 (Sprint 8 head). 354 tests passing, 7 deselected.
Aggregate global coverage: 61 % (3 159 statements, 1 241 missing).
| Module group | Statements | Missing | Coverage |
|---|---|---|---|
puma.perturbations.* |
76 | 0 | 100 % |
puma.datasets.* |
140 | 16 | 89 % |
puma.storage.* |
192 | 39 | 80 % |
puma.preflight.* |
302 | 70 | 77 % |
puma.metrics.* |
242 | 58 | 76 % |
puma.runtime.* |
164 | 48 | 71 % |
puma.orchestrator.* |
346 | 125 | 64 % |
puma.cli (single module) |
471 | 189 | 60 % |
puma.scenarios.* |
280 | 127 | 55 % |
puma.sustainability.* |
51 | 32 | 37 % |
puma.dashboard.* |
683 | 431 | 37 % |
puma.reporting.* |
87 | 87 | 0 % |
Modules with 100 % coverage¶
- All
__init__.pypackage files (trivially covered by imports). puma.perturbations.*(text.py,gender_swap_prefix.py,register_shift.py) — fully covered. Each perturbation is a pure function with deterministic SHA-256 indexing and is exercised by both unit tests and the bias-evaluation sweep.puma.preflight.catalogandpuma.preflight.profile— the catalog loader and the profile-selection logic are central to dispatch and are exhaustively tested intests/unit/test_catalog_metadata.pyandtests/unit/test_preflight_profile.py.puma.orchestrator.runspec— every RunSpec YAML field is covered bytests/unit/test_runspec.py.puma.storage.models— SQLAlchemy ORM definitions are covered by every test that opens the database.puma.scenarios._reasoning,puma.scenarios.base— the reasoning parser shim and the abstract scenario base class.
Modules with < 40 % coverage — and why¶
| Module | Coverage | Reason |
|---|---|---|
puma.reporting.report |
0 % | Markdown/PDF report generation. Exercised manually during release prep; output is reviewed by eye. Adding unit tests is on the backlog but low priority — failures here are visual and immediately obvious. |
puma.dashboard.components |
26 % | Streamlit widget wrappers. Streamlit's runtime is not amenable to fine-grained unit testing; smoke tests in tests/unit/test_dashboard_smoke.py cover the rendering path end-to-end via streamlit.testing.AppTest. |
puma.dashboard.views.* |
9–21 % | Same rationale as components. Each view is a render() function that pulls filtered data from session state and renders Plotly/Streamlit primitives. Smoke tests validate they import and render without crashing on representative data; deeper coverage would require a Streamlit harness we have chosen not to build. |
puma.orchestrator.compare |
0 % | Inter-run comparison helper. Exercised via puma compare in manual workflows; CLI happy-path coverage is in tests/cli/. The Python function compare_runs is currently called only through the CLI shim. |
puma.preflight.report |
34 % | Rich-formatted CLI report builder. Output is verified by reading the puma preflight output during dev. The branches that go untested are detector-specific formatting paths (NVIDIA vs CPU-only vs AMD) — hard to exercise without varied hardware. |
puma.sustainability.codecarbon_wrapper |
37 % | Wraps the CodeCarbon EmissionsTracker. The Sprint-2 D15 fix is covered by tests/unit/test_emissions_d15.py for the critical paths; remaining missing lines are vendor-init branches (cold-start cpu_power_meter etc.) that PUMA does not control. |
puma.dashboard.app (router) |
81 % | Above the 40 % threshold and listed for completeness; uncovered lines are the dark-mode CSS branch and one fallback path used only on first-load with no runs in the DB. |
Modules in the 40–80 % band¶
These are the natural focus areas for incremental test additions without invoking the Streamlit/Plotly testing barrier:
puma.scenarios.estimation_tawos(45 %),prioritization_jira(64 %),triage_jira(53 %) — scenario-level branches and edge cases (malformed Ollama responses, retry behaviour) are partially covered. v2.5.0 added the canonical estimation baseline; future Sprints may add fuzz tests of the parser.puma.orchestrator.runner(64 %) — covers the happy path; less covered: perturbation-cascade error paths and emissions integration.puma.cli(60 %) — Typer command handlers; happy-path coverage is thorough, but option-validation edge cases are partially tested.puma.runtime.client(61 %),puma.runtime.cache(88 %) — Ollama client wrapper; some branches require a live Ollama and are covered by the@pytest.mark.ollamatests.
Tests excluded by default¶
| Exclusion mechanism | Count | Reason |
|---|---|---|
-m "not ollama" in CI's lint-and-test job |
7 | Tests require a live Ollama server. Run instead by the integration-tests-ollama CI job (push to main/develop only) introduced in v2.5.0. |
No slow marker is currently in use; long-running smoke runs are in
tests/smoke/ and gated by being in a separate directory rather than
by a pytest marker.
What the global number means¶
61 % global coverage is the right floor for this codebase, not a
target to chase. The Streamlit dashboard contributes 683 of the 1 241
missing statements (55 % of the gap); excluding the dashboard, the
non-UI code is around 75 % covered ((3 159 − 683)
statements − (1 241 − 431) missing = 2 476 statements with
810 missing = 67 %; further excluding puma.reporting.report
at 0 % gives 70 %). Future Sprint budgets are better spent on
methodology (statistical tests, new scenarios, hardware tiers)
than on chasing the Streamlit view percentages, which would
require a Streamlit testing harness investment with limited
defect-detection upside.
If a regression slips through the existing suite, the right response is to add a targeted test for that specific failure mode, not to generally raise the coverage number.