PUMA sustainability tracking¶

PUMA records the energy use and estimated carbon emissions of every benchmark run. This document explains why, how the measurement works, what the numbers mean, and — just as important — what they do not mean.

Why measure carbon¶

Running language models, even small local ones, consumes energy, and energy has a carbon cost that varies by hardware and electrical grid. A benchmark that reports only task accuracy hides half the trade-off: a model that is marginally more accurate but several times more energy-hungry is not obviously the better choice for a project-management office deciding what to deploy. Reporting energy and carbon alongside F1 and MAE lets that trade-off be made explicitly. The case for treating compute cost as a first-class metric in NLP evaluation was made by Strubell et al. (2019); PUMA applies that principle at inference time for PMO-oriented tasks.

How PUMA measures¶

Tracker. PUMA uses CodeCarbon to measure energy and estimate CO₂. The tracker version is captured per run (codecarbon_version, e.g. 3.2.7 in the reference configuration). The integration lives in src/puma/orchestrator/runner.py and src/puma/sustainability/codecarbon_wrapper.py; the per-run figures are persisted in the emissions table (src/puma/storage/models.py, the Emission model) and surfaced in a submission's sustainability block (src/puma/community/schema.py).

Tracking mode. The tracker runs in tracking_mode = "machine". In PUMA's split-container architecture the model runs in a separate Ollama container, so whole-machine accounting attributes the GPU/CPU/RAM energy of the inference host correctly; a process-scoped mode would miss the GPU work entirely (this was the substance of the earlier GPU-attribution work).

Coverage. In the current PUMA release, the canonical baseline specs ship with sustainability.codecarbon: true, so every canonical run produces an emissions record by default. Attribution is per-run, not per-instance: one emissions row summarises the whole run.

Country accounting. country_iso defaults to ESP (Spain) for reproducibility of the reference figures. It selects the grid carbon-intensity factor CodeCarbon applies to convert energy into CO₂. Users running PUMA elsewhere should override it to match their own grid; the energy figure (energy_kwh_total) is grid-independent, only the CO₂ conversion changes.

Hardware coverage. PUMA defines five baseline hardware profiles in config/profiles.yaml: cpu-lite, cpu-standard, gpu-entry, gpu-mid, and gpu-high (plus Apple-Silicon variants, not covered here). All five were verified structurally in the current PUMA release — each loads, declares no per-profile codecarbon override (the settings are global, so measurement is identical across profiles), and is accepted by the runner when requested via a spec. Empirical measurement was performed on the host's auto-detected profile, gpu-entry; live measurement on the other profiles awaits access to that hardware.

What the numbers mean¶

A PUMA emissions record contains, in plain language:

energy_kwh_total — total electrical energy the inference host drew during the run, in kilowatt-hours (sum of CPU, GPU, and RAM energy).
co2_grams_total — that energy converted to grams of CO₂-equivalent using the configured country's grid intensity.
duration_s — wall-clock seconds the tracker was active.
cpu_energy / gpu_energy / ram_energy — the per-component breakdown; for GPU-bound runs gpu_energy dominates.

Order of magnitude. On the reference gpu-entry host, a 200-instance canonical run observed:

Run	`energy_kwh_total`	`co2_grams_total`	`duration_s`
`baseline_triage` (qwen2.5:3b)	~0.00065 kWh	~0.11 g	~43 s
`baseline_estimation_canonical` (qwen2.5:3b)	~0.0012 kWh	~0.21 g	~54 s

These are the values in this configuration; a different model, host, run size, or country will produce different numbers. The useful takeaway is the scale: a canonical PUMA baseline on a small local model costs a fraction of a gram of CO₂ and well under a hundredth of a kWh — roughly the order of leaving a laptop charger drawing for a minute. As a coarse sanity reference, a single kWh on a typical European grid corresponds to a few hundred grams of CO₂-eq; our observed ratio (sub-gram CO₂ for sub-milli-kWh energy) sits comfortably in that range, which is one way to tell the tracker is producing sane output rather than a stub.

Limitations¶

PUMA measures a slice of the footprint, not the whole world. Specifically:

Single-machine scope. The figures cover the inference host only. They exclude data-center overhead (PUE), networking, and the amortised cost of training the model — all of which can dwarf inference for a widely deployed model.
Country code is an approximation. country_iso selects an average grid factor, not a real-time, location-specific marginal intensity.
Cache and warm-up effects. The first run after a cold start has different energy and latency than later runs. PUMA's KV-cache-hygiene protocol (restart Ollama before each canonical baseline) exists to make runs comparable across specs, not to minimise energy; reported energy is therefore representative of a freshly-restarted server, not a steady-state one.
Estimation method. CodeCarbon reads CPU energy via RAPL on supported Linux and falls back to a model-based estimate elsewhere; on hosts where RAPL is restricted or absent the CPU term may be under-attributed.
Self-attestation. Emissions values in a published submission are reported by the runner that produced them. The community Verifier checks the integrity hash of the predictions, not the accuracy of the declared emissions — there is no independent re-measurement of energy.

How to reproduce¶

The canonical specs that produce the reference figures are specs/runs/baseline_triage.yaml and specs/runs/baseline_estimation_canonical.yaml; both enable codecarbon by default in the current PUMA release. Re-run and validate them with:

puma validate-baseline --spec specs/runs/baseline_triage.yaml --expected-f1 0.5867
puma validate-baseline --spec specs/runs/baseline_estimation_canonical.yaml --expected-mae 5.7150

Each run writes an emissions row to the emissions table; puma share-results --dry-run surfaces it in the submission's sustainability block.