Testing — structure, coverage, and execution¶

This document describes PUMA's test organisation and records the coverage breakdown by module group as of v2.5.0. It exists so that contributors can see where tests are dense and where gaps are acceptable, rather than guessing from a single global number.

Test layout¶

Directory	Purpose	Typical fixtures
`tests/unit/`	Pure-Python unit tests; no Ollama, no Docker volumes.	Module-local YAML, in-memory SQLite
`tests/integration/`	Cross-module integration. Sub-marked `@pytest.mark.ollama` when a live Ollama server is required.	`puma.db` fixture, real Ollama for marked tests
`tests/cli/`	One file per `puma <command>`; covers `--help` exposure, happy path, error paths.	Typer `CliRunner`, `monkeypatch` of `_run_baseline_for_validation` and similar helpers
`tests/smoke/`	End-to-end smoke runs gated by long-running marks.	Real Ollama, small N, abbreviated specs

Markers¶

Marker	Meaning	Default behaviour
`@pytest.mark.unit`	Pure-Python; no external dependencies.	Always selected
`@pytest.mark.integration`	Cross-module; may touch the filesystem.	Always selected
`@pytest.mark.ollama`	Requires a live Ollama server with at least one pulled model.	Deselected by default (`-m "not ollama"` in CI's lint-and-test job). Run by the separate `integration-tests-ollama` CI job on push to `main`/`develop` (see `.github/workflows/lint-and-test.yml`, added in v2.5.0 — I8).

To run every test locally including Ollama-marked:

docker compose up -d puma_ollama puma_runner
docker exec puma_runner pytest tests/ -v

To run only the deselect-by-default suite (the same set CI's lint-and-test job runs):

docker exec puma_runner pytest tests/ -m "not ollama" --cov=puma --cov-report=term-missing

Coverage by module group — v2.5.0¶

Measured with pytest --cov=puma -m "not ollama" on commit d5996f1 (Sprint 8 head). 354 tests passing, 7 deselected.

Aggregate global coverage: 61 % (3 159 statements, 1 241 missing).

Module group	Statements	Missing	Coverage
`puma.perturbations.*`	76	0	100 %
`puma.datasets.*`	140	16	89 %
`puma.storage.*`	192	39	80 %
`puma.preflight.*`	302	70	77 %
`puma.metrics.*`	242	58	76 %
`puma.runtime.*`	164	48	71 %
`puma.orchestrator.*`	346	125	64 %
`puma.cli` (single module)	471	189	60 %
`puma.scenarios.*`	280	127	55 %
`puma.sustainability.*`	51	32	37 %
`puma.dashboard.*`	683	431	37 %
`puma.reporting.*`	87	87	0 %

Modules with 100 % coverage¶

All __init__.py package files (trivially covered by imports).
puma.perturbations.* (text.py, gender_swap_prefix.py, register_shift.py) — fully covered. Each perturbation is a pure function with deterministic SHA-256 indexing and is exercised by both unit tests and the bias-evaluation sweep.
puma.preflight.catalog and puma.preflight.profile — the catalog loader and the profile-selection logic are central to dispatch and are exhaustively tested in tests/unit/test_catalog_metadata.py and tests/unit/test_preflight_profile.py.
puma.orchestrator.runspec — every RunSpec YAML field is covered by tests/unit/test_runspec.py.
puma.storage.models — SQLAlchemy ORM definitions are covered by every test that opens the database.
puma.scenarios._reasoning, puma.scenarios.base — the reasoning parser shim and the abstract scenario base class.

Modules with < 40 % coverage — and why¶

Module	Coverage	Reason
`puma.reporting.report`	0 %	Markdown/PDF report generation. Exercised manually during release prep; output is reviewed by eye. Adding unit tests is on the backlog but low priority — failures here are visual and immediately obvious.
`puma.dashboard.components`	26 %	Streamlit widget wrappers. Streamlit's runtime is not amenable to fine-grained unit testing; smoke tests in `tests/unit/test_dashboard_smoke.py` cover the rendering path end-to-end via `streamlit.testing.AppTest`.
`puma.dashboard.views.*`	9–21 %	Same rationale as `components`. Each view is a `render()` function that pulls filtered data from session state and renders Plotly/Streamlit primitives. Smoke tests validate they import and render without crashing on representative data; deeper coverage would require a Streamlit harness we have chosen not to build.
`puma.orchestrator.compare`	0 %	Inter-run comparison helper. Exercised via `puma compare` in manual workflows; CLI happy-path coverage is in `tests/cli/`. The Python function `compare_runs` is currently called only through the CLI shim.
`puma.preflight.report`	34 %	Rich-formatted CLI report builder. Output is verified by reading the `puma preflight` output during dev. The branches that go untested are detector-specific formatting paths (NVIDIA vs CPU-only vs AMD) — hard to exercise without varied hardware.
`puma.sustainability.codecarbon_wrapper`	37 %	Wraps the CodeCarbon `EmissionsTracker`. The Sprint-2 D15 fix is covered by tests/unit/test_emissions_d15.py for the critical paths; remaining missing lines are vendor-init branches (cold-start `cpu_power_meter` etc.) that PUMA does not control.
`puma.dashboard.app` (router)	81 %	Above the 40 % threshold and listed for completeness; uncovered lines are the dark-mode CSS branch and one fallback path used only on first-load with no runs in the DB.

Modules in the 40–80 % band¶

These are the natural focus areas for incremental test additions without invoking the Streamlit/Plotly testing barrier:

puma.scenarios.estimation_tawos (45 %), prioritization_jira (64 %), triage_jira (53 %) — scenario-level branches and edge cases (malformed Ollama responses, retry behaviour) are partially covered. v2.5.0 added the canonical estimation baseline; future Sprints may add fuzz tests of the parser.
puma.orchestrator.runner (64 %) — covers the happy path; less covered: perturbation-cascade error paths and emissions integration.
puma.cli (60 %) — Typer command handlers; happy-path coverage is thorough, but option-validation edge cases are partially tested.
puma.runtime.client (61 %), puma.runtime.cache (88 %) — Ollama client wrapper; some branches require a live Ollama and are covered by the @pytest.mark.ollama tests.

Tests excluded by default¶

Exclusion mechanism	Count	Reason
`-m "not ollama"` in CI's lint-and-test job	7	Tests require a live Ollama server. Run instead by the `integration-tests-ollama` CI job (push to main/develop only) introduced in v2.5.0.

No slow marker is currently in use; long-running smoke runs are in tests/smoke/ and gated by being in a separate directory rather than by a pytest marker.

What the global number means¶

61 % global coverage is the right floor for this codebase, not a target to chase. The Streamlit dashboard contributes 683 of the 1 241 missing statements (55 % of the gap); excluding the dashboard, the non-UI code is around 75 % covered ((3 159 − 683) statements − (1 241 − 431) missing = 2 476 statements with 810 missing = 67 %; further excluding puma.reporting.report at 0 % gives 70 %). Future Sprint budgets are better spent on methodology (statistical tests, new scenarios, hardware tiers) than on chasing the Streamlit view percentages, which would require a Streamlit testing harness investment with limited defect-detection upside.

If a regression slips through the existing suite, the right response is to add a targeted test for that specific failure mode, not to generally raise the coverage number.