Adding a Model to the Catalog¶
This guide explains how to register a new Ollama-compatible model in PUMA so it can be selected in run-specs, surfaced by puma models recommended, and (once pulled into Ollama) listed by puma models list.
Overview¶
Adding a model requires three steps:
- Pull the model via Ollama so it is available for inference
- Register it in
config/models_catalog.yaml - Validate with a dry-run benchmark
Step 1 — Pull the model¶
Pull inside the puma_ollama container (preferred — the model lands in the shared ollama_models volume so every PUMA service sees it):
Or directly from the host if Ollama is installed:
Note.
pumaitself never pulls models; thepuma modelssub-group is read-only (list/show/recommended). Pulling is delegated to the Ollama CLI for a single source of truth.
Check it is available:
Step 2 — Register in config/models_catalog.yaml¶
Open config/models_catalog.yaml and add an entry under the models: key:
models:
# ... existing entries ...
- ollama_tag: llama3.2:3b
params_b: 3
gguf_size_gb: 2.0
profiles_compatible:
- cpu-standard
- gpu-entry
- gpu-mid
context_window: 128000
languages: [en]
notes: "Meta Llama 3.2 3B instruct — fast, small context window"
Field reference¶
| Field | Type | Required | Description |
|---|---|---|---|
ollama_tag |
string | yes | Exact tag for ollama pull (must match exactly) |
params_b |
float | yes | Parameter count in billions (e.g. 3 for 3B) |
gguf_size_gb |
float | yes | Approximate disk size of the GGUF file |
profiles_compatible |
list[str] | yes | Hardware profiles that can run this model without OOM |
context_window |
int | yes | Maximum context length in tokens |
languages |
list[str] | no | ISO 639-1 language codes the model supports well |
notes |
string | no | Free-text description surfaced as the rationale column in puma models recommended |
Profile compatibility guidelines¶
| Profile | Max model size |
|---|---|
cpu-lite |
≤ 1.5B params / ≤ 1.0 GB |
cpu-standard |
≤ 7B params / ≤ 5.0 GB |
gpu-entry |
≤ 7B params / ≤ 5.0 GB VRAM |
gpu-mid |
≤ 14B params / ≤ 10.0 GB VRAM |
gpu-high |
≤ 32B+ params / ≤ 24.0 GB VRAM |
Step 3 — Create a smoke run-spec¶
Create a file in specs/runs/ to test the new model:
id: smoke_llama3_triage
description: "Smoke run: llama3.2:3b × triage_jira × zero-shot"
scenario: triage_jira
sample_size: 10
models:
- llama3.2:3b
adaptation:
strategy: [zero-shot]
inference:
temperature: 0.0
seed: 42
metrics: [f1_macro]
Step 4 — Validate¶
Run in dry-run mode first (no Ollama call, validates the full pipeline):
Expected output:
Run complete: smoke_llama3_triage__<hash>__<ts>
Predictions: 10
parse_failure_rate: 1.0000 ← expected in dry-run (response is "[dry-run]")
Then run live:
Step 5 — (Optional) Add custom prompt templates¶
PUMA's default templates work with most instruction-tuned models. If the new model requires a specific chat format (e.g. [INST]...[/INST] for Mistral, <|user|>...<|assistant|> for Phi), create a model-specific template file:
Available Jinja2 template variables:
| Variable | Description |
|---|---|
{{ title }} |
Issue title |
{{ description }} |
Issue description / body |
{{ examples }} |
List of few-shot example dicts (empty for zero-shot) |
{{ labels }} |
List of valid output labels |
Model-specific template dispatch is not yet implemented (all models share templates). To specialise, create a separate strategy or fork the template file.
Step 6 — Compare with existing models¶
After running the smoke benchmark, compare results against a baseline:
Or open the dashboard to see the heatmap: