Skip to content

PUMA Architecture

1. Data Flow

RunSpec (YAML)
Runner.__init__()
    ├── RunSpec.from_yaml()          Pydantic v2 validation + cross-validators
    ├── init_db()                    SQLAlchemy: create tables if not exist
    └── run_id = f"{spec.id}__{spec_hash}__{timestamp}"
Runner.run()
    ├─ [DB] INSERT runs (status="running")
    ├─ Runner._execute_inferences()
    │       │
    │       ├── Scenario.sample(n, seed)        DataFrame from CSV dataset
    │       ├── OllamaClient(host, timeout)     HTTP client (httpx)
    │       ├── InferenceCache(db_path)         SQLite prompt-hash → response
    │       │
    │       └── for model × strategy × row × perturbation:
    │               ├── _apply_perturbation()   typos / case / truncate / noise
    │               ├── Strategy.build_prompt() Jinja2 template render
    │               ├── [cache hit?] → return cached response
    │               ├── OllamaClient.generate_sync()   POST /api/generate
    │               └── Scenario.parse_response()  regex → label or None
    ├─ Runner._compute_metrics()
    │       ├── classification_metrics()    F1, accuracy, confusion matrix
    │       ├── regression_metrics()        MAE, MDAE, RMSE, MAE by bin
    │       ├── percentiles()               latency p50/p95/p99
    │       └── parse_failure_rate
    ├─ Runner._persist_predictions()
    │       └── [DB] INSERT instances + predictions
    ├─ [DB] UPDATE runs (status="done")
    ├─ [DB] INSERT metrics (flat key → value rows)
    ├─ [DB] INSERT profile_snapshots
    ├─ results/<run_id>/runspec.yaml     frozen spec
    └─ results/<run_id>/metrics.json    all computed metrics

2. Package Map

Package Module(s) Key classes / functions
puma.preflight detect, profile, provisioning, report detect_capabilities(), select_profile(), check_provisioning()
puma.runtime client, cache OllamaClient, InferenceCache
puma.datasets jira, tawos, verify load_jira(), load_tawos(), verify_jira(), print_verify_report()
puma.scenarios triage_jira, estimation_tawos, prioritization_jira, base Scenario ABC, 3 concrete classes
puma.adaptation base, strategies, examples Strategy ABC, 11 strategy classes, get_strategy(), select_examples()
puma.perturbations text typos(), case_change(), truncate(), tech_noise(), reorder_fields()
puma.metrics accuracy, calibration, robustness, fairness, efficiency, stability classification_metrics(), expected_calibration_error(), robustness_score(), etc.
puma.sustainability codecarbon_wrapper @track_emissions, emissions_summary(), gco2_per_f1_point()
puma.orchestrator runspec, runner, compare RunSpec, Runner, compare_runs()
puma.storage models, db Run, Instance, Prediction, Metric, Emission, ProfileSnapshot, init_db(), session_scope()
puma.dashboard app, components, data Streamlit 9-view app, load_runs(), metrics_pivot(), metric_card(), etc.
puma.reporting report generate_report(), _convert_to_pdf()
puma.cli cli Typer app: 15 commands + 5 command groups

3. Docker Services

docker-compose.yml
├── puma_ollama          image: ollama/ollama:latest
│   ├── port: 11434:11434
│   ├── volume: ollama_models:/root/.ollama
│   └── network: puma_network
├── puma_runner          build: Dockerfile (python:3.11-slim + pip install)
│   ├── volume: .:/app  (live code mount)
│   ├── volume: puma_data:/app/data
│   ├── env: PYTHONPATH=/app/src, OLLAMA_HOST=http://puma_ollama:11434
│   └── network: puma_network
└── puma_dashboard       build: same Dockerfile
    ├── port: 8501:8501
    ├── volume: .:/app, puma_data:/app/data
    ├── command: streamlit run src/puma/dashboard/app.py ...
    └── network: puma_network

Shared volumes: - ollama_models — Ollama model weights (persistent across container restarts) - puma_data — SQLite databases and datasets (mounted at /app/data)

4. Database Schema

All tables use SQLAlchemy 2.0 declarative models (src/puma/storage/models.py).

-- One row per benchmark run
CREATE TABLE runs (
    run_id      TEXT PRIMARY KEY,   -- "{spec.id}__{hash}__{timestamp}"
    spec_hash   TEXT,               -- SHA-256[:16] of RunSpec (excluding description)
    spec_yaml   TEXT,               -- JSON-serialised RunSpec for replay
    profile     TEXT,               -- hardware profile used
    started_at  DATETIME,
    finished_at DATETIME,
    status      TEXT                -- running | done | error
);

-- Canonical dataset instances (deduplicated across runs)
CREATE TABLE instances (
    instance_id TEXT PRIMARY KEY,
    dataset     TEXT,               -- triage_jira | estimation_tawos | ...
    source_id   TEXT,               -- original ID from the dataset
    input_text  TEXT,               -- raw input (title + description)
    gold_label  TEXT,
    UNIQUE (dataset, source_id)
);

-- One row per model × strategy × instance × perturbation
CREATE TABLE predictions (
    id            INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id        TEXT REFERENCES runs(run_id),
    instance_id   TEXT REFERENCES instances(instance_id),
    model         TEXT,
    strategy      TEXT,
    prompt_hash   TEXT,             -- SHA-256[:16] of the rendered prompt
    raw_response  TEXT,
    parsed_label  TEXT,             -- null if parse_response returned None
    latency_ms    REAL,
    tokens_in     INTEGER,
    tokens_out    INTEGER,
    perturbation  TEXT,             -- null = original; else perturbation name
    seed          INTEGER,
    recorded_at   DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Flat metric rows (one per metric name per run)
CREATE TABLE metrics (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id      TEXT REFERENCES runs(run_id),
    scope       TEXT DEFAULT 'global',
    metric_name TEXT,               -- e.g. "f1_macro", "latency.p95"
    value       REAL,
    computed_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- CodeCarbon emissions per run
CREATE TABLE emissions (
    id         INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id     TEXT REFERENCES runs(run_id),
    kwh        REAL,
    co2_kg     REAL,
    duration_s REAL,
    recorded_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Hardware snapshot at run time
CREATE TABLE profile_snapshots (
    id             INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id         TEXT REFERENCES runs(run_id),
    os             TEXT,
    cpu            TEXT,
    ram_gb         REAL,
    gpu            TEXT,
    vram_gb        REAL,
    ollama_version TEXT,
    puma_version   TEXT,
    snapshot_at    DATETIME DEFAULT CURRENT_TIMESTAMP
);

5. OllamaClient Contract

@dataclass(frozen=True)
class TokenLogprob:
    token: str
    logprob: float
    top_logprobs: list["TokenLogprob"]

@dataclass(frozen=True)
class GenerationResult:
    model: str
    response: str
    logprobs: list[TokenLogprob]
    total_duration_ns: int
    load_duration_ns: int
    prompt_eval_count: int      # tokens in prompt
    eval_count: int             # tokens generated
    eval_duration_ns: int
    raw: dict                   # full Ollama JSON response

class OllamaClient:
    def generate_sync(self, model, prompt, *, temperature, seed,
                      max_tokens, logprobs, top_logprobs) -> GenerationResult: ...
    async def generate(self, ...) -> GenerationResult: ...

The client always sends options: {"temperature": ..., "seed": ..., "num_predict": ...} in the /api/generate payload. When logprobs=True, it adds "logprobs": true, "top_logprobs": N.

Retry policy: 3 attempts with exponential backoff on connection errors.

6. Inference Cache

InferenceCache stores (model, prompt_hash, temperature, seed) → (response, tokens_in, tokens_out) in data/cache/inferences.db. On a cache hit the Ollama call is skipped entirely.

cache = InferenceCache(db_path=Path("data/cache/inferences.db"))
hit = cache.get(model, prompt_hash, temperature, seed)
if hit is None:
    result = client.generate_sync(...)
    cache.put(model, prompt_hash, temperature, seed, result)

7. Prompt Template System

Templates use Jinja2. Each scenario × strategy pair has its own .jinja file:

specs/prompts/
├── triage_jira/
│   ├── zero_shot.jinja
│   ├── zero_shot_cot.jinja
│   ├── few_shot.jinja
│   ├── cot_few_shot.jinja
│   ├── rcoif.jinja
│   ├── contextual_anchoring.jinja
│   └── egi.jinja
├── estimation_tawos/      (same 7 files)
└── prioritization_jira/   (same 7 files)

Available template variables:

Variable Type Description
{{ title }} str Issue title
{{ description }} str Issue description / body
{{ examples }} list[dict] Few-shot examples (empty for zero-shot)
{{ gold_label }} str Expected label (used in CoT rationale examples)
{{ labels }} list[str] Valid output labels for the scenario

Strategy.build_prompt(scenario, instance) renders the template and returns the final prompt string.

8. Key Design Decisions

Decision Rationale
Spec-driven runs RunSpec YAML + fixed seed makes every run 100% reproducible
Dry-run mode Full pipeline test without Ollama; used in all 206 unit tests
PYTHONPATH=/app/src No editable install needed; volume-mount code works immediately
Read-only dashboard Streamlit never writes to DB; preserves result integrity
SQLite over Postgres Zero-infrastructure; single file; embeds in Docker volume
Flat metrics table (run_id, metric_name, value) enables pivot and comparison without schema changes when new metrics are added
session_scope() context manager Guarantees rollback on exception; prevents partial writes
Sync OllamaClient in Runner Avoids event-loop complexity in the orchestration loop; async variant available for future parallel batching
Parse failure = None Failed parses are excluded from metric computation but counted in parse_failure_rate; no "unknown" class pollution