PUMA Architecture¶

1. Data Flow¶

RunSpec (YAML)
    │
    ▼
Runner.__init__()
    ├── RunSpec.from_yaml()          Pydantic v2 validation + cross-validators
    ├── init_db()                    SQLAlchemy: create tables if not exist
    └── run_id = f"{spec.id}__{spec_hash}__{timestamp}"
    │
    ▼
Runner.run()
    │
    ├─ [DB] INSERT runs (status="running")
    │
    ├─ Runner._execute_inferences()
    │       │
    │       ├── Scenario.sample(n, seed)        DataFrame from CSV dataset
    │       ├── OllamaClient(host, timeout)     HTTP client (httpx)
    │       ├── InferenceCache(db_path)         SQLite prompt-hash → response
    │       │
    │       └── for model × strategy × row × perturbation:
    │               ├── _apply_perturbation()   typos / case / truncate / noise
    │               ├── Strategy.build_prompt() Jinja2 template render
    │               ├── [cache hit?] → return cached response
    │               ├── OllamaClient.generate_sync()   POST /api/generate
    │               └── Scenario.parse_response()  regex → label or None
    │
    ├─ Runner._compute_metrics()
    │       ├── classification_metrics()    F1, accuracy, confusion matrix
    │       ├── regression_metrics()        MAE, MDAE, RMSE, MAE by bin
    │       ├── percentiles()               latency p50/p95/p99
    │       └── parse_failure_rate
    │
    ├─ Runner._persist_predictions()
    │       └── [DB] INSERT instances + predictions
    │
    ├─ [DB] UPDATE runs (status="done")
    ├─ [DB] INSERT metrics (flat key → value rows)
    ├─ [DB] INSERT profile_snapshots
    │
    ├─ results/<run_id>/runspec.yaml     frozen spec
    └─ results/<run_id>/metrics.json    all computed metrics

2. Package Map¶

Package	Module(s)	Key classes / functions
`puma.preflight`	`detect`, `profile`, `provisioning`, `report`	`detect_capabilities()`, `select_profile()`, `check_provisioning()`
`puma.runtime`	`client`, `cache`	`OllamaClient`, `InferenceCache`
`puma.datasets`	`jira`, `tawos`, `verify`	`load_jira()`, `load_tawos()`, `verify_jira()`, `print_verify_report()`
`puma.scenarios`	`triage_jira`, `estimation_tawos`, `prioritization_jira`, `base`	`Scenario` ABC, 3 concrete classes
`puma.adaptation`	`base`, `strategies`, `examples`	`Strategy` ABC, 11 strategy classes, `get_strategy()`, `select_examples()`
`puma.perturbations`	`text`	`typos()`, `case_change()`, `truncate()`, `tech_noise()`, `reorder_fields()`
`puma.metrics`	`accuracy`, `calibration`, `robustness`, `fairness`, `efficiency`, `stability`	`classification_metrics()`, `expected_calibration_error()`, `robustness_score()`, etc.
`puma.sustainability`	`codecarbon_wrapper`	`@track_emissions`, `emissions_summary()`, `gco2_per_f1_point()`
`puma.orchestrator`	`runspec`, `runner`, `compare`	`RunSpec`, `Runner`, `compare_runs()`
`puma.storage`	`models`, `db`	`Run`, `Instance`, `Prediction`, `Metric`, `Emission`, `ProfileSnapshot`, `init_db()`, `session_scope()`
`puma.dashboard`	`app`, `components`, `data`	Streamlit 9-view app, `load_runs()`, `metrics_pivot()`, `metric_card()`, etc.
`puma.reporting`	`report`	`generate_report()`, `_convert_to_pdf()`
`puma.cli`	`cli`	Typer app: 15 commands + 5 command groups

3. Docker Services¶

docker-compose.yml
│
├── puma_ollama          image: ollama/ollama:latest
│   ├── port: 11434:11434
│   ├── volume: ollama_models:/root/.ollama
│   └── network: puma_network
│
├── puma_runner          build: Dockerfile (python:3.11-slim + pip install)
│   ├── volume: .:/app  (live code mount)
│   ├── volume: puma_data:/app/data
│   ├── env: PYTHONPATH=/app/src, OLLAMA_HOST=http://puma_ollama:11434
│   └── network: puma_network
│
└── puma_dashboard       build: same Dockerfile
    ├── port: 8501:8501
    ├── volume: .:/app, puma_data:/app/data
    ├── command: streamlit run src/puma/dashboard/app.py ...
    └── network: puma_network

Shared volumes: - ollama_models — Ollama model weights (persistent across container restarts) - puma_data — SQLite databases and datasets (mounted at /app/data)

4. Database Schema¶

All tables use SQLAlchemy 2.0 declarative models (src/puma/storage/models.py).

-- One row per benchmark run
CREATE TABLE runs (
    run_id      TEXT PRIMARY KEY,   -- "{spec.id}__{hash}__{timestamp}"
    spec_hash   TEXT,               -- SHA-256[:16] of RunSpec (excluding description)
    spec_yaml   TEXT,               -- JSON-serialised RunSpec for replay
    profile     TEXT,               -- hardware profile used
    started_at  DATETIME,
    finished_at DATETIME,
    status      TEXT                -- running | done | error
);

-- Canonical dataset instances (deduplicated across runs)
CREATE TABLE instances (
    instance_id TEXT PRIMARY KEY,
    dataset     TEXT,               -- triage_jira | estimation_tawos | ...
    source_id   TEXT,               -- original ID from the dataset
    input_text  TEXT,               -- raw input (title + description)
    gold_label  TEXT,
    UNIQUE (dataset, source_id)
);

-- One row per model × strategy × instance × perturbation
CREATE TABLE predictions (
    id            INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id        TEXT REFERENCES runs(run_id),
    instance_id   TEXT REFERENCES instances(instance_id),
    model         TEXT,
    strategy      TEXT,
    prompt_hash   TEXT,             -- SHA-256[:16] of the rendered prompt
    raw_response  TEXT,
    parsed_label  TEXT,             -- null if parse_response returned None
    latency_ms    REAL,
    tokens_in     INTEGER,
    tokens_out    INTEGER,
    perturbation  TEXT,             -- null = original; else perturbation name
    seed          INTEGER,
    recorded_at   DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Flat metric rows (one per metric name per run)
CREATE TABLE metrics (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id      TEXT REFERENCES runs(run_id),
    scope       TEXT DEFAULT 'global',
    metric_name TEXT,               -- e.g. "f1_macro", "latency.p95"
    value       REAL,
    computed_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- CodeCarbon emissions per run
CREATE TABLE emissions (
    id         INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id     TEXT REFERENCES runs(run_id),
    kwh        REAL,
    co2_kg     REAL,
    duration_s REAL,
    recorded_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Hardware snapshot at run time
CREATE TABLE profile_snapshots (
    id             INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id         TEXT REFERENCES runs(run_id),
    os             TEXT,
    cpu            TEXT,
    ram_gb         REAL,
    gpu            TEXT,
    vram_gb        REAL,
    ollama_version TEXT,
    puma_version   TEXT,
    snapshot_at    DATETIME DEFAULT CURRENT_TIMESTAMP
);

5. OllamaClient Contract¶

@dataclass(frozen=True)
class TokenLogprob:
    token: str
    logprob: float
    top_logprobs: list["TokenLogprob"]

@dataclass(frozen=True)
class GenerationResult:
    model: str
    response: str
    logprobs: list[TokenLogprob]
    total_duration_ns: int
    load_duration_ns: int
    prompt_eval_count: int      # tokens in prompt
    eval_count: int             # tokens generated
    eval_duration_ns: int
    raw: dict                   # full Ollama JSON response

class OllamaClient:
    def generate_sync(self, model, prompt, *, temperature, seed,
                      max_tokens, logprobs, top_logprobs) -> GenerationResult: ...
    async def generate(self, ...) -> GenerationResult: ...

The client always sends options: {"temperature": ..., "seed": ..., "num_predict": ...} in the /api/generate payload. When logprobs=True, it adds "logprobs": true, "top_logprobs": N.

Retry policy: 3 attempts with exponential backoff on connection errors.

6. Inference Cache¶

InferenceCache stores (model, prompt_hash, temperature, seed) → (response, tokens_in, tokens_out) in data/cache/inferences.db. On a cache hit the Ollama call is skipped entirely.

cache = InferenceCache(db_path=Path("data/cache/inferences.db"))
hit = cache.get(model, prompt_hash, temperature, seed)
if hit is None:
    result = client.generate_sync(...)
    cache.put(model, prompt_hash, temperature, seed, result)

7. Prompt Template System¶

Templates use Jinja2. Each scenario × strategy pair has its own .jinja file:

specs/prompts/
├── triage_jira/
│   ├── zero_shot.jinja
│   ├── zero_shot_cot.jinja
│   ├── few_shot.jinja
│   ├── cot_few_shot.jinja
│   ├── rcoif.jinja
│   ├── contextual_anchoring.jinja
│   └── egi.jinja
├── estimation_tawos/      (same 7 files)
└── prioritization_jira/   (same 7 files)

Available template variables:

Variable	Type	Description
`{{ title }}`	str	Issue title
`{{ description }}`	str	Issue description / body
`{{ examples }}`	list[dict]	Few-shot examples (empty for zero-shot)
`{{ gold_label }}`	str	Expected label (used in CoT rationale examples)
`{{ labels }}`	list[str]	Valid output labels for the scenario

Strategy.build_prompt(scenario, instance) renders the template and returns the final prompt string.

8. Key Design Decisions¶

Decision	Rationale
Spec-driven runs	RunSpec YAML + fixed seed makes every run 100% reproducible
Dry-run mode	Full pipeline test without Ollama; used in all 206 unit tests
PYTHONPATH=/app/src	No editable install needed; volume-mount code works immediately
Read-only dashboard	Streamlit never writes to DB; preserves result integrity
SQLite over Postgres	Zero-infrastructure; single file; embeds in Docker volume
Flat metrics table	`(run_id, metric_name, value)` enables pivot and comparison without schema changes when new metrics are added
session_scope() context manager	Guarantees rollback on exception; prevents partial writes
Sync OllamaClient in Runner	Avoids event-loop complexity in the orchestration loop; async variant available for future parallel batching
Parse failure = None	Failed parses are excluded from metric computation but counted in `parse_failure_rate`; no "unknown" class pollution