PUMA — Index¶

PUMA Understanding & Management w Agents Local, reproducible, multi-dimensional benchmarking of open LLMs on project management tasks.

1. What is PUMA?¶

PUMA evaluates open large language models (served locally via Ollama) on three PMO tasks: issue triage, story-point estimation, and backlog prioritization. It produces a multi-dimensional performance profile per model covering accuracy, calibration, robustness, fairness, inference efficiency, and carbon footprint — all running 100% locally, GDPR-compliant by design.

Why PUMA?¶

Existing LLM benchmarks measure accuracy in cloud environments. PUMA addresses a different question: which open model should a project management office deploy on-premise, given their hardware, risk tolerance, and sustainability constraints?

Key differentiators:

Property	PUMA
Inference	100% local via Ollama — no external API calls
Reproducibility	Spec-driven: every run fully described by a declarative YAML
Metrics	Accuracy + calibration + robustness + fairness + efficiency + CO₂
Scenarios	Triage (Jira), Estimation (TAWOS), Prioritization (pairwise)
Hardware	Runs on consumer laptops (8 GB RAM) up to GPU workstations
Privacy	No data leaves the machine

2. System Architecture¶

┌─────────────────────────────────────────────────────────┐
│  Host machine (Docker + docker compose)                  │
│                                                          │
│  ┌──────────────┐   ┌──────────────┐  ┌──────────────┐  │
│  │ puma_runner  │   │ puma_ollama  │  │puma_dashboard│  │
│  │              │──▶│  :11434      │  │   :8501      │  │
│  │  puma CLI    │   │  LLM server  │  │  Streamlit   │  │
│  └──────┬───────┘   └──────────────┘  └──────────────┘  │
│         │                                                 │
│         ▼                                                 │
│  ┌──────────────────────────────┐                        │
│  │  puma_data volume            │                        │
│  │  data/puma.db  (SQLite)      │                        │
│  │  data/cache/inferences.db    │                        │
│  │  results/<run_id>/           │                        │
│  └──────────────────────────────┘                        │
└─────────────────────────────────────────────────────────┘

Full data-flow and package map → architecture.md

3. Benchmark Scenarios¶

3.1 Issue Triage (`triage_jira`)¶

Task: Given a Jira issue title and description, assign a priority label. Labels: Critical, Major, Minor, Trivial Dataset: 200 balanced Jira issues (50 per class) Primary metric: F1 macro

3.2 Story-Point Estimation (`estimation_tawos`)¶

Task: Given a user story title and description, predict story points. Output: Fibonacci number from {1, 2, 3, 5, 8, 13, 21, 34, 55, 89} Dataset: TAWOS — 9 020 agile backlog items Primary metric: MAE

3.3 Backlog Prioritization (`prioritization_jira`)¶

Task: Given two Jira issues A and B, decide which has higher priority. Output: A or B Dataset: Pairwise samples from Jira SR dataset Primary metric: Accuracy

Full scenario specs → scenarios_reference.md

4. Prompting Strategies¶

PUMA implements nine prompting strategies covering the major paradigms in the literature:

Strategy	ID	Description
Zero-shot	`zero-shot`	Direct instruction, no examples
Zero-shot CoT	`zero-shot-cot`	Add "think step by step"
One-shot	`one-shot`	Single in-context example
Few-shot k=3	`few-shot-3`	Three stratified examples
Few-shot k=5	`few-shot-5`	Five stratified examples
Few-shot k=8	`few-shot-8`	Eight stratified examples
CoT few-shot	`cot-few-shot`	Few-shot with CoT rationales
RCOIF	`rcoif`	Role + Context + Output + Instruction + Format
Contextual Anchoring	`contextual-anchoring`	Grounds prediction in project context
Self-Consistency	`self-consistency`	Majority vote over n samples (requires temperature > 0)
EGI	`egi`	Example-Guided Inference

Templates live in specs/prompts/<scenario>/<strategy>.jinja.

5. Metrics¶

5.1 Accuracy¶

F1 macro / weighted — classification tasks
Accuracy — overall correct fraction
MAE / MDAE / RMSE — regression tasks (story points)
MAE by SP bin — 1–3 / 5–8 / 13–21 / 34+

5.2 Calibration¶

ECE (Expected Calibration Error) — equal-width bins
MCE (Maximum Calibration Error)
Brier Score
Reliability diagrams (PNG export)

5.3 Robustness¶

Robustness score — max(0, 1 − |metric_orig − metric_perturbed|)
Consistency rate — fraction of predictions unchanged under perturbation

Text perturbations: typos_5pct, case_upper, case_lower, truncate_50pct, tech_noise

5.4 Fairness¶

Per-group accuracy — broken down by any categorical attribute
Fairness gap — max − min accuracy across groups

5.5 Efficiency¶

Latency percentiles — p50, p95, p99, mean (ms)
Throughput — instances per minute
Parse failure rate — fraction of responses that could not be parsed

5.6 Sustainability¶

CO₂ equivalent (g / kg) via CodeCarbon (offline, process-level)
Energy consumed (kWh)
gCO₂ per F1 point — quality-adjusted cost

Full formulas → metrics_reference.md

6. Model Catalog¶

Models are registered in config/models_catalog.yaml. Current catalog:

Model	Params	Size	CPU-lite	CPU-standard	GPU-entry
qwen2.5:0.5b	0.5B	~0.4 GB	✓	✓	✓
qwen2.5:1.5b	1.5B	~1.0 GB	✓	✓	✓
qwen2.5:3b	3B	~2.0 GB	—	✓	✓
qwen2.5:7b	7B	~4.7 GB	—	✓	✓
llama3.2:3b	3B	~2.0 GB	—	✓	✓
mistral:7b	7B	~4.1 GB	—	✓	✓
deepseek-r1:7b	7B	~4.7 GB	—	—	✓

Add a new model → adding_models.md

7. Hardware Profiles¶

Profile	RAM	VRAM	Recommended models
`cpu-lite`	8 GB	—	qwen2.5:0.5b, qwen2.5:1.5b
`cpu-standard`	16 GB	—	qwen2.5:3b, qwen2.5:7b, llama3.2:3b
`gpu-entry`	16 GB	4 GB	qwen2.5:7b, mistral:7b
`gpu-mid`	32 GB	8 GB	qwen2.5:14b
`gpu-high`	64 GB	16 GB+	qwen2.5:32b, deepseek-r1:14b

Auto-detected by puma preflight. Override with --profile <name>.

8. Storage Schema¶

All results are persisted to data/puma.db (SQLite, read by dashboard and report generator):

Table	Contents
`runs`	Run ID, spec hash, profile, status, timestamps
`instances`	Canonical dataset items (instance_id, gold label, input text)
`predictions`	Per-prediction rows: model, strategy, raw response, parsed label, latency, tokens
`metrics`	Flat metric name → value per run (for pivot and comparison)
`emissions`	CodeCarbon output: kWh, CO₂ kg, duration
`profile_snapshots`	Hardware snapshot at run time: CPU, RAM, GPU, Ollama version

9. Project Structure¶

puma/
├── src/puma/               # Main package (PYTHONPATH=/app/src)
│   ├── preflight/          # Hardware detection and profile selection
│   ├── runtime/            # OllamaClient, InferenceCache
│   ├── datasets/           # Dataset loaders and verification
│   ├── scenarios/          # Benchmark task definitions
│   ├── adaptation/         # Prompting strategies and example selection
│   ├── perturbations/      # Text perturbation functions
│   ├── metrics/            # All metric computations
│   ├── sustainability/     # CodeCarbon wrapper
│   ├── orchestrator/       # RunSpec, Runner, compare_runs
│   ├── storage/            # SQLAlchemy ORM (6 tables)
│   ├── dashboard/          # Streamlit app (9 views)
│   ├── reporting/          # Markdown + PDF report generation
│   └── cli.py              # Unified CLI entrypoint
├── tests/
│   ├── unit/               # 206 fast tests, no external deps
│   ├── integration/        # Require data files
│   └── smoke/              # AppTest + end-to-end dry-run
├── specs/
│   ├── prompts/            # Jinja2 templates per scenario × strategy
│   ├── runs/               # Example and gate run-specs
│   └── scenarios/          # Scenario YAML specs
├── docs/                   # Extended documentation
├── config/                 # models_catalog.yaml, runtime_profile.yaml
├── data/                   # Datasets and SQLite DB (gitignored)
├── results/                # Run artifacts: runspec.yaml, metrics.json, report.md
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── start_puma.sh
└── pyproject.toml

10. Success Criteria¶

A user cloning the repo on a machine with 16 GB RAM and Docker can:

Run ./start_puma.sh with no additional configuration.
Wait less than 20 minutes for provisioning (model download + dataset verification).
Run puma run specs/runs/smoke_triage.yaml and see progress in real time.
Open http://localhost:8501 and explore results in the dashboard.
Generate a report with puma report <run_id>.
Compare models with puma compare <run_id_1> <run_id_2>.

All of the above: 100% local, fully traceable, with carbon emissions recorded.

11. Links¶

Resource	Path
User guide	user_guide.md
Architecture	architecture.md
Metrics reference	metrics_reference.md
Scenarios reference	scenarios_reference.md
Adding models	adding_models.md
Adding scenarios	adding_scenarios.md
Troubleshooting	troubleshooting.md
Contributing	CONTRIBUTING.md (repo root)
Changelog	CHANGELOG.md (repo root)