Adding a Model to the Catalog¶

This guide explains how to register a new Ollama-compatible model in PUMA so it can be selected in run-specs, surfaced by puma models recommended, and (once pulled into Ollama) listed by puma models list.

Overview¶

Adding a model requires three steps:

Pull the model via Ollama so it is available for inference
Register it in config/models_catalog.yaml
Validate with a dry-run benchmark

Step 1 — Pull the model¶

Pull inside the puma_ollama container (preferred — the model lands in the shared ollama_models volume so every PUMA service sees it):

docker compose exec puma_ollama ollama pull llama3.2:3b

Or directly from the host if Ollama is installed:

ollama pull llama3.2:3b

Note. puma itself never pulls models; the puma models sub-group is read-only (list / show / recommended). Pulling is delegated to the Ollama CLI for a single source of truth.

Check it is available:

docker compose exec puma_ollama ollama list

Step 2 — Register in `config/models_catalog.yaml`¶

Open config/models_catalog.yaml and add an entry under the models: key:

models:
  # ... existing entries ...

  - ollama_tag: llama3.2:3b
    params_b: 3
    gguf_size_gb: 2.0
    profiles_compatible:
      - cpu-standard
      - gpu-entry
      - gpu-mid
    context_window: 128000
    languages: [en]
    notes: "Meta Llama 3.2 3B instruct — fast, small context window"

Field reference¶

Field	Type	Required	Description
`ollama_tag`	string	yes	Exact tag for `ollama pull` (must match exactly)
`params_b`	float	yes	Parameter count in billions (e.g. `3` for 3B)
`gguf_size_gb`	float	yes	Approximate disk size of the GGUF file
`profiles_compatible`	list[str]	yes	Hardware profiles that can run this model without OOM
`context_window`	int	yes	Maximum context length in tokens
`languages`	list[str]	no	ISO 639-1 language codes the model supports well
`notes`	string	no	Free-text description surfaced as the `rationale` column in `puma models recommended`

Profile compatibility guidelines¶

Profile	Max model size
`cpu-lite`	≤ 1.5B params / ≤ 1.0 GB
`cpu-standard`	≤ 7B params / ≤ 5.0 GB
`gpu-entry`	≤ 7B params / ≤ 5.0 GB VRAM
`gpu-mid`	≤ 14B params / ≤ 10.0 GB VRAM
`gpu-high`	≤ 32B+ params / ≤ 24.0 GB VRAM

Step 3 — Create a smoke run-spec¶

Create a file in specs/runs/ to test the new model:

id: smoke_llama3_triage
description: "Smoke run: llama3.2:3b × triage_jira × zero-shot"
scenario: triage_jira
sample_size: 10
models:
  - llama3.2:3b
adaptation:
  strategy: [zero-shot]
inference:
  temperature: 0.0
  seed: 42
metrics: [f1_macro]

Step 4 — Validate¶

Run in dry-run mode first (no Ollama call, validates the full pipeline):

puma run specs/runs/smoke_llama3_triage.yaml --dry-run

Expected output:

Run complete: smoke_llama3_triage__<hash>__<ts>
Predictions: 10
  parse_failure_rate: 1.0000    ← expected in dry-run (response is "[dry-run]")

Then run live:

puma run specs/runs/smoke_llama3_triage.yaml

Step 5 — (Optional) Add custom prompt templates¶

PUMA's default templates work with most instruction-tuned models. If the new model requires a specific chat format (e.g. [INST]...[/INST] for Mistral, <|user|>...<|assistant|> for Phi), create a model-specific template file:

specs/prompts/triage_jira/zero_shot.jinja       ← default (used by all models)

Available Jinja2 template variables:

Variable	Description
`{{ title }}`	Issue title
`{{ description }}`	Issue description / body
`{{ examples }}`	List of few-shot example dicts (empty for zero-shot)
`{{ labels }}`	List of valid output labels

Model-specific template dispatch is not yet implemented (all models share templates). To specialise, create a separate strategy or fork the template file.

Step 6 — Compare with existing models¶

After running the smoke benchmark, compare results against a baseline:

puma compare smoke_qwen25_3b__<hash>__<ts> smoke_llama3_triage__<hash>__<ts>

Or open the dashboard to see the heatmap:

open http://localhost:8501