PUMA User Guide¶

This guide walks through the complete workflow: provisioning → running benchmarks → comparing results → generating reports → exploring the dashboard.

Table of Contents¶

Installation and provisioning
Hardware preflight
Managing models
Managing datasets
Writing a run-spec
Running a benchmark
Comparing runs
Generating reports
Using the dashboard
Managing the database
Managing the inference cache
Advanced: perturbations
Advanced: multiple strategies
Advanced: sustainability tracking

1. Installation and Provisioning¶

Prerequisites¶

Docker Engine 24+ and Docker Compose v2 installed on the host.
At least 8 GB RAM (16 GB recommended for 3B parameter models).
Internet access for initial model and dataset download.

No Python installation is needed on the host.

One-shot provisioning¶

git clone <repo-url>
cd puma
./start_puma.sh

start_puma.sh performs six steps automatically:

Step	Action
1	Verify Docker and Docker Compose are available
2	Build the `puma_runner` Docker image
3	Run `puma preflight` to detect hardware and select a profile
4	Start `puma_ollama` and `puma_dashboard` services
5	Pull the two smallest models for the detected profile
6	Verify / download datasets and apply the database schema

Flags:

./start_puma.sh --profile cpu-standard   # override hardware detection
./start_puma.sh --skip-models            # skip model download (models already pulled)
./start_puma.sh --skip-datasets          # skip dataset verification
./start_puma.sh --smoke-only             # run a dry-run smoke test after provisioning
./start_puma.sh --observability          # start optional Grafana overlay
./start_puma.sh --verbose                # print every shell command (set -x)

Accessing the CLI¶

All puma commands run inside the puma_runner container:

docker compose run --rm puma_runner puma <command> [options]

For convenience, create an alias:

alias puma='docker compose run --rm puma_runner puma'

2. Hardware Preflight¶

Before running benchmarks, PUMA detects your hardware and selects a compatible execution profile.

puma preflight

Example output:

Hardware capabilities
  CPU:    Intel Core i7-12700K
  RAM:    32.0 GB
  GPU:    NVIDIA RTX 3080 (10.0 GB VRAM)
  Ollama: 0.12.15

Selected profile: gpu-entry

Provisioning checks
  [OK]  Ollama is running
  [OK]  VRAM sufficient for qwen2.5:7b (4.7 GB < 10.0 GB)
  [OK]  Disk space: 87.2 GB free
  [WARN] Ollama version 0.12.15 — logprobs supported

Profile written to config/runtime_profile.yaml

Override the detected profile:

puma preflight --profile cpu-standard

Skip writing the config file:

puma preflight --no-write-config

Profiles and their requirements:

Profile	RAM	VRAM	Notes
`cpu-lite`	8 GB	—	Tiny models (≤1.5B) only
`cpu-standard`	16 GB	—	Models up to 7B on CPU
`gpu-entry`	16 GB	4 GB	7B models on GPU
`gpu-mid`	32 GB	8 GB	14B models
`gpu-high`	64 GB	16 GB+	32B+ models

3. Managing Models¶

PUMA's CLI never pulls or mutates models — pulling is delegated to the Ollama CLI for a single source of truth. The puma models sub-group is read-only and exposes three commands:

Command	What it shows	Source
`puma models list`	Tags pulled locally in Ollama	Ollama `/api/tags`
`puma models show <name>`	Details for one locally-pulled model	Ollama `/api/show`
`puma models recommended`	Curated PUMA catalog with local availability	`config/models_catalog.yaml` + `/api/tags`

See what is installed locally¶

puma models list

Shows every tag Ollama already has on disk.

See the curated catalog (with availability)¶

puma models recommended

Lists the PUMA-validated models from config/models_catalog.yaml annotated with whether each one is already available locally — useful to decide what to ollama pull next.

Pull a model¶

Pulling goes through the Ollama CLI (inside the puma_ollama container for the Compose flow, so the model lands in the shared ollama_models volume):

docker compose exec puma_ollama ollama pull qwen2.5:3b

Or directly from the host if Ollama is installed there:

ollama pull qwen2.5:3b

puma doctor will hint ollama pull <name> for any catalog model it cannot find locally — the same guidance, surfaced from the health check.

4. Managing Datasets¶

Verify datasets¶

puma datasets verify

Checks file existence, row counts, and checksums for: - data/jira_balanced_200.csv — 200 Jira issues, 50 per priority class - data/tawos_clean.csv — 9 020 TAWOS agile backlog items

Example output:

============================================================
PUMA Dataset Verification
============================================================
[OK] jira_balanced_200.csv — 200 rows, 4 classes, hash OK
[OK] tawos_clean.csv — 9020 rows, 10 SP values, hash OK
============================================================

If datasets are missing, download them:

docker compose run --rm puma_runner python scripts/prepare_datasets.py

5. Writing a Run-Spec¶

A run-spec is a YAML file that fully defines a benchmark. Place it in specs/runs/.

Minimal example¶

id: quick_triage
description: "Quick zero-shot triage test"
scenario: triage_jira
sample_size: 10
models:
  - qwen2.5:3b
adaptation:
  strategy: [zero-shot]
inference:
  temperature: 0.0
  seed: 42
metrics:
  - f1_macro

Full reference¶

id: full_benchmark_v1              # unique identifier (used in run_id and results path)
description: "Full benchmark"      # human-readable description

scenario: triage_jira              # triage_jira | estimation_tawos | prioritization_jira
sample_size: 50                    # number of dataset instances (1–10000)

models:                            # one or more Ollama model tags
  - qwen2.5:3b
  - qwen2.5:1.5b

adaptation:
  strategy:                        # one or more strategy IDs
    - zero-shot
    - few-shot-3
    - cot-few-shot

inference:
  temperature: 0.0                 # 0.0 for greedy; > 0 required for self-consistency
  seed: 42                         # fixed for reproducibility
  max_tokens: 256                  # maximum tokens in model response
  logprobs: false                  # set true to enable calibration metrics (Ollama ≥0.12.11)
  top_logprobs: 0                  # number of top logprob tokens to return

perturbations:                     # optional: list of text perturbations
  - typos_5pct                     # 5% character substitution
  - case_upper                     # uppercase entire text
  # - case_lower
  # - truncate_50pct
  # - tech_noise

metrics:                           # metrics to compute and store
  - f1_macro
  # - mae (for estimation_tawos)
  # - accuracy

sustainability:
  codecarbon: false                # set true to track CO₂ emissions via CodeCarbon
  country_iso: ESP                 # ISO 3166-1 alpha-3 for carbon intensity lookup

repeat: 1                          # number of independent repeats (for stability metrics)
profile_required: null             # null = use active profile; or force a specific one

Validation rules¶

scenario must be one of the three registered scenarios.
models must have at least one entry.
sample_size must be between 1 and 10 000.
self-consistency strategy requires temperature > 0.
perturbations generates one additional prediction set per perturbation per instance.

6. Running a Benchmark¶

Dry run (no Ollama required)¶

A dry run builds all prompts, runs the full persistence pipeline, and writes artifacts — but returns [dry-run] instead of calling Ollama.

puma run specs/runs/smoke_triage.yaml --dry-run

Use this to validate your run-spec, test the DB schema, and check prompt templates.

Live run¶

puma run specs/runs/smoke_triage.yaml

The runner shows a Rich progress bar and structured logs:

2026-04-18 12:00:00 [info] run.start  run_id=smoke_triage_v1__abc123__20260418T120000
  smoke_triage_v1__abc123__20260418T120000 ━━━━━━━━━━━━ 20/20 0:01:23

Run complete: smoke_triage_v1__abc123__20260418T120000
Predictions: 40
  f1_macro: 0.6218
  accuracy: 0.6500
  parse_failure_rate: 0.0500

Run artifacts¶

After a run, results are stored in two places:

File system (results/<run_id>/):

results/smoke_triage_v1__abc123__20260418T120000/
├── runspec.yaml     # frozen copy of the run-spec used
├── metrics.json     # computed metrics as JSON
└── report.md        # generated after puma report (or --report flag)

Database (data/puma.db): - runs — run record with status and timestamps - predictions — one row per model × strategy × instance × perturbation - metrics — flat metric values for dashboard and comparison - profile_snapshots — hardware state at run time

Custom Ollama host¶

puma run spec.yaml --ollama-host http://192.168.1.10:11434

The OLLAMA_HOST environment variable is also respected.

7. Comparing Runs¶

Compare two runs¶

puma compare run_id_1 run_id_2

Output:

| Metric              | run_id_1 | run_id_2 |
|---------------------|----------|----------|
| accuracy            | 0.6500   | 0.7100   |
| f1_macro            | 0.6218   | 0.6891   |
| parse_failure_rate  | 0.0500   | 0.0250   |

Differences (run2 - run1):
  accuracy: +0.0600
  f1_macro: +0.0673
  parse_failure_rate: -0.0250

Compare three or more runs¶

puma compare run_a run_b run_c

Produces the comparison table without the diffs section.

Save comparison to file¶

puma compare run_id_1 run_id_2 --output comparison.json

8. Generating Reports¶

puma report <run_id>

The report is written to results/<run_id>/report.md and contains:

Executive summary — scenario, models, strategies, sample size, timestamps
Metrics table — all metrics computed for the run
Per-model breakdown — predictions and parse failures per model (if multiple models)
Robustness section — perturbation coverage (if perturbations were used)
Sustainability section — CO₂ and energy data (if CodeCarbon was enabled)
Latency section — p50/p95/p99 latency in ms

Convert to PDF (requires Pandoc)¶

puma report <run_id> --format pdf

If pandoc and xelatex are available inside the container, a report.pdf is created alongside report.md. If not installed, the command silently produces only the Markdown file.

Custom database path¶

puma report <run_id> --db /path/to/custom.db

9. Using the Dashboard¶

The dashboard is a read-only Streamlit application that reads directly from data/puma.db.

Start the dashboard service¶

docker compose up -d puma_dashboard
open http://localhost:8501

Or launch from the CLI:

puma dashboard
puma dashboard --port 8502 --host 127.0.0.1

Dashboard views¶

Overview¶

Shows a card for each run. Each card displays: - Run ID and status - F1 macro (or accuracy for other scenarios) - Parse failure rate

Use the Runs multiselect in the sidebar to filter which runs appear.

Model Comparison¶

Renders an interactive heatmap: rows are runs, columns are metrics. Color scale is green (high) → red (low) via RdYlGn.

Below the heatmap, a raw metrics table is shown with download option (PNG).

Reliability¶

Plots reliability diagrams (calibration curves) comparing model confidence to actual accuracy per bin. Requires logprobs: true in the run-spec and Ollama ≥ 0.12.11. Falls back to synthetic data if no logprob data is available.

Robustness¶

For runs with perturbations, shows a bar chart of prediction consistency rates per perturbation type. A consistency rate of 1.0 means the model gives the same answer whether or not the text was perturbed.

Fairness¶

Breaks down accuracy by model. Displays the fairness gap (max − min accuracy across models). When group attributes are available in predictions, per-group metrics are shown.

Sustainability Frontier¶

A Pareto scatter plot of F1 macro (y-axis) vs latency proxy (x-axis). Each point is a run. The ideal model is top-left (high quality, low latency / energy).

Instance Drill-down¶

Select any run and any instance to inspect: - Gold label (ground truth) - Parsed label (what PUMA extracted from the response) - Raw LLM response text - Latency in ms, token counts (in/out) - Prompt hash (for cache lookup)

Filter	Effect
Runs	Restrict all views to selected run IDs
Date range	Filter runs by start date
Models	Filter predictions by model name

10. Managing the Database¶

Apply schema (first time or after updates)¶

puma db migrate

Lists all created tables:

Schema applied to data/puma.db
  table: emissions
  table: instances
  table: metrics
  table: predictions
  table: profile_snapshots
  table: runs

Check database size¶

puma db status

data/puma.db: 1.4 MB

Direct SQL inspection¶

docker compose run --rm puma_runner python3 -c "
import sqlite3
conn = sqlite3.connect('data/puma.db')
for (name,) in conn.execute(\"SELECT run_id FROM runs ORDER BY started_at DESC LIMIT 5\"):
    print(name)
"

11. Managing the Inference Cache¶

PUMA caches Ollama responses by prompt hash in data/cache/inferences.db. Cached responses are returned instantly on repeated runs with identical prompts, seeds, and models.

Show cache statistics¶

puma cache stats

Inference cache: 342 entries, 128.4 KB

Clear the cache¶

puma cache clear

Use this when you want to force fresh inference (e.g., after updating a model or changing temperature).

12. Advanced: Perturbations¶

Perturbations test model robustness by applying text transformations to the input and comparing predictions against the unperturbed baseline.

Available perturbations¶

ID	Effect
`typos_5pct`	Replace 5% of characters with homoglyphs (a→а, o→0, …)
`case_upper`	Convert all text to UPPERCASE
`case_lower`	Convert all text to lowercase
`truncate_50pct`	Keep only the first 50% of the text
`tech_noise`	Insert random technical jargon tokens

How it works¶

For each instance, PUMA produces 1 + len(perturbations) predictions: - One with the original text (perturbation = null) - One per perturbation (perturbation = <name>)

All predictions are stored in the predictions table and the Robustness view in the dashboard shows consistency rates.

Example run-spec with perturbations¶

id: robustness_test
scenario: triage_jira
sample_size: 30
models: [qwen2.5:3b]
adaptation:
  strategy: [zero-shot]
inference:
  temperature: 0.0
  seed: 42
perturbations:
  - typos_5pct
  - case_upper
  - truncate_50pct
metrics: [f1_macro]

This generates 30 × 4 = 120 predictions (original + 3 perturbations).

13. Advanced: Multiple Strategies¶

Running multiple strategies in a single spec is efficient — the dataset is sampled once and each strategy builds its own prompts from the same rows.

id: strategy_comparison
scenario: triage_jira
sample_size: 50
models: [qwen2.5:3b]
adaptation:
  strategy:
    - zero-shot
    - few-shot-3
    - rcoif
    - contextual-anchoring
inference:
  temperature: 0.0
  seed: 42
metrics: [f1_macro, accuracy]

This produces 50 × 4 = 200 predictions. After the run, use puma compare to see which strategy performs best.

Self-consistency (majority vote)¶

Self-consistency samples the model multiple times and takes the majority vote. It requires temperature > 0:

adaptation:
  strategy: [self-consistency]
inference:
  temperature: 0.7
  seed: 42

14. Advanced: Sustainability Tracking¶

Enable CodeCarbon to record CO₂ and energy consumption for each run:

sustainability:
  codecarbon: true
  country_iso: ESP        # used for regional carbon intensity lookup

The @track_emissions decorator wraps the inference loop. Emissions are: - Written to results/<run_id>/emissions_data.csv - Stored in the emissions table in data/puma.db - Shown in the report under the Sustainability section - Plotted in the Sustainability Frontier dashboard view

Quality-adjusted cost metric¶

After a run with emissions tracking, PUMA computes:

gCO₂_per_F1_point = total_gCO₂ / (f1_macro × 100)

This lets you compare models not just by accuracy but by their carbon cost per unit of quality gained.

Quick Reference Card¶

# Provision a clean machine
./start_puma.sh

# Hardware check
puma preflight

# List locally-installed models
puma models list

# See the curated catalog with availability
puma models recommended

# Pull a model (delegated to Ollama)
docker compose exec puma_ollama ollama pull qwen2.5:3b

# Verify datasets
puma datasets verify

# Dry-run (no Ollama needed)
puma run specs/runs/smoke_triage.yaml --dry-run

# Live benchmark
puma run specs/runs/smoke_triage.yaml

# Compare two runs
puma compare <run_id_1> <run_id_2>

# Generate a report
puma report <run_id>

# Generate a PDF report
puma report <run_id> --format pdf

# Open the dashboard
open http://localhost:8501   # (or puma dashboard)

# Database schema
puma db migrate
puma db status

# Inference cache
puma cache stats
puma cache clear