Baseline Inventory¶
Snapshot of repository state before Phase 0 restructuring. Produced by executing Step 1 of AGENT_INSTRUCTIONS.md.
1. Repository Overview¶
- Branch: main
- Commits: 1 (
af00114 PUMA) - Python runtime declared: 3.11 (Dockerfile)
- Package layout: flat — no
src/puma/package yet; code lives insrc/(scripts) andagents/(agent stubs) - No
pyproject.toml— dependencies managed viarequirements.txtonly
2. Directory Structure¶
puma/
├── agents/ # Agent stubs (LangGraph-style, not functional)
│ ├── __init__.py
│ ├── code_generator_agent.py
│ ├── estimation_agent.py
│ ├── orchestrator.py
│ ├── reviewer_agent.py
│ ├── tester_agent.py
│ └── triage_agent.py
├── assets/ # Static assets
├── data/ # Runtime datasets (gitignored partially)
│ ├── jira_balanced_200.csv — 200 balanced Jira issues (4 classes × 50)
│ ├── tawos_clean.csv — TAWOS cleaned (story points, project col)
│ └── tawos_raw.csv — TAWOS raw
├── db/ # TAWOS SQL dump
│ └── TAWOS.sql
├── reports/ # Benchmark reports (figures deleted)
│ └── summary_report.json
├── results/ # Evaluation outputs
│ ├── benchmark_history.csv
│ ├── estimation_cache.json
│ ├── evaluation_log.txt
│ ├── triage_cache.json
│ ├── triage_metrics.json — F1-macro=0.5087 (mistral:7b)
│ └── triage_metrics_mistral_7b.json
├── scripts/
│ ├── create_jira_data.py — Jira dataset builder (external URLs broken)
│ ├── download_datasets.py — Dataset downloader
│ └── run_all_models.sh — Multi-model benchmark runner
├── specs/ # Existing specs (minimal)
│ ├── architecture.md
│ ├── constitution.md
│ ├── estimation-agent.spec.md
│ ├── triage-agent.spec.md
│ └── prompts/
├── src/ # Core evaluation scripts (flat, not a package)
│ ├── cleanup.py
│ ├── data_prep.py
│ ├── evaluate_estimation.py
│ ├── evaluate_triage.py
│ ├── history.py
│ ├── rag_index.py
│ └── statistical_analysis.py
├── tests/
│ ├── __init__.py
│ └── test_core.py — 14 tests (integration + unit)
├── AGENT_INSTRUCTIONS.md
├── Dockerfile — python:3.11-slim, no GPU support
├── docker-compose.yml — services: ollama + evaluator
├── emissions.csv — CodeCarbon output
├── index.md — Architecture/scope document
├── pytest.ini
├── README.md
├── requirements.txt
└── start_puma.sh — Entry point (basic, no preflight/profile logic)
3. Existing Modules¶
src/evaluate_triage.py¶
- Function: Zero-shot triage classification via Ollama HTTP API (requests-based)
- Model:
qwen2.5:3b(envLLM_MODEL) - Classes:
TriageEvaluatorwithevaluate_issue()andevaluate_batch() - Parser:
parse_prediction()— regex match onCritical|Major|Minor|Trivial - Metrics: F1-macro, confusion matrix via scikit-learn
- Cache: JSON flat-file (
results/triage_cache.json) - CodeCarbon:
@track_emissionsdecorator - Config: all via env vars (
TRIAGE_TEMPERATURE,TRIAGE_SEED, etc.)
src/evaluate_estimation.py¶
- Function: Few-shot story-point estimation via Ollama
- Model:
qwen2.5:3b(envLLM_MODEL) - Classes:
EstimationEvaluatorwithevaluate_item()andevaluate_batch() - Parser:
parse_story_points()— float extraction with regex - Metrics: MAE via scikit-learn
- Cache: JSON flat-file (
results/estimation_cache.json) - Fibonacci series:
[1, 2, 3, 5, 8, 13, 21] - Few-shot: 3 hardcoded examples
- Config: all via env vars
src/history.py¶
- Function: Persists benchmark runs to
results/benchmark_history.csv - Functions:
save_to_history(),get_ollama_model_info(),get_system_info() - HW detection:
platform,psutil(CPU, RAM); no GPU detection
src/data_prep.py¶
- Function: Balances Jira dataset, cleans TAWOS, produces
data/*.csv - Key functions:
load_and_balance_jira(),load_and_clean_tawos()
src/statistical_analysis.py¶
- Function: Wilcoxon tests, effect sizes, confidence intervals on benchmark results
- Dependencies:
scipy.stats,sklearn.metrics
src/cleanup.py¶
- Function: Removes stale caches and temp files
src/rag_index.py¶
- Function: Stub/placeholder for RAG indexing (not functional)
agents/orchestrator.py¶
- Function: Thin orchestrator stub — delegates to
src/evaluate_*.py - Pattern: Class
Orchestratorwithrun_workflow(), no real LangGraph usage
agents/{triage,estimation,reviewer,tester,code_generator}_agent.py¶
- Function: Stub agent classes — no real Ollama calls or logic
4. Dependencies (requirements.txt)¶
| Package | Version | Purpose |
|---|---|---|
| pandas | (unpinned) | DataFrames |
| scikit-learn | (unpinned) | Metrics |
| scipy | (unpinned) | Statistical tests |
| ollama | (unpinned) | Ollama Python client |
| codecarbon | (unpinned) | CO₂ tracking |
| matplotlib | (unpinned) | Plots |
| seaborn | (unpinned) | Plots |
| requests | (unpinned) | HTTP calls |
| pytest | (unpinned) | Tests |
Missing vs. target (from AGENT_INSTRUCTIONS.md §F0.2):
typer, httpx, pydantic, pyyaml, jinja2, numpy, sqlalchemy, psutil, streamlit, langdetect, pytest-cov, ruff, mypy, structlog, rich, pre-commit, alembic, respx
5. Tests (tests/test_core.py)¶
14 test cases across 7 classes:
| Class | Count | Type | Status |
|---|---|---|---|
TestDataFiles |
5 | Integration (needs CSV files) | Pass (data present) |
TestTriageEvaluator |
4 | Unit | Pass |
TestEstimationEvaluator |
5 | Unit | Pass |
TestStatisticalAnalysis |
4 | Unit | Pass |
TestCodeCarbon |
1 | Import | Pass |
TestOllamaClient |
1 | Import | Pass |
TestEndToEnd |
2 | E2E (needs Ollama) | Skip without Ollama |
No tests/unit/, tests/integration/, tests/smoke/ subdirectories yet.
6. Docker / Infrastructure¶
docker-compose.yml¶
- Services:
ollama(ollama/ollama:latest),evaluator(custom Dockerfile) - No GPU support in compose file
- No dashboard service, no Grafana service
- Config via env vars in compose
Dockerfile¶
- Base:
python:3.11-slim - Installs requirements, copies repo
- No entrypoint (CMD:
tail -f /dev/null)
start_puma.sh¶
- Starts Docker Compose
- Pulls Ollama model
- Runs
src/evaluate_triage.pyandsrc/evaluate_estimation.py - No HW detection, no profile selection, no preflight
7. Known Results (MVP Baseline)¶
| Task | Model | Metric | Value |
|---|---|---|---|
| Triage | qwen2.5:3b | F1-macro | ~0.5867 (reported) / 0.5087 (mistral:7b in saved JSON) |
| Estimation | qwen2.5:3b | MAE | 1.86 SP |
8. What Is Missing vs. Target Architecture¶
| Component | Target (INDEX.md) | Current State |
|---|---|---|
src/puma/ package |
Full modular package | Does not exist |
pyproject.toml |
Required | Does not exist |
preflight/ |
HW detection + profiles | Not implemented |
runtime/ |
OllamaClient (httpx, logprobs) | Uses requests + ollama SDK ad-hoc |
datasets/ |
Jira SR + TAWOS downloaders | Partial (scripts/download_datasets.py) |
scenarios/ |
Declarative YAML scenarios | Stubs in specs/ |
adaptation/ |
9 prompting strategies | Zero-shot only |
perturbations/ |
5 perturbation types | Not implemented |
metrics/ |
7 metric families | Only accuracy (F1, MAE) |
sustainability/ |
CodeCarbon wrapper | Basic @track_emissions |
storage/ |
SQLAlchemy + SQLite | JSON flat-file caches |
dashboard/ |
Streamlit app | Not implemented |
cli.py |
Typer CLI | Not implemented |
config/profiles.yaml |
5 HW profiles | Not implemented |
config/models_catalog.yaml |
Full model catalog | Not implemented |
specs/runs/ |
Declarative run-specs | Not implemented |
specs/prompts/ |
Jinja prompt templates | Partial (1 file) |
tests/unit/ |
≥80% coverage on core modules | Not structured |
tests/integration/ |
Ollama integration tests | Not structured |
tests/smoke/ |
E2E smoke tests | Not structured |
Makefile |
make lint/test/smoke |
Not implemented |
ruff / mypy |
Linting and type checking | Not configured |
| Logging | structlog JSON lines | Basic logging module |
9. Files to Preserve (Conservative Refactor)¶
The following files contain working logic that must be migrated (not deleted) during Phase 0:
src/evaluate_triage.py→ migrate tosrc/puma/scenarios/triage_jira.py+src/puma/runtime/src/evaluate_estimation.py→ migrate tosrc/puma/scenarios/estimation_tawos.pysrc/history.py→ migrate tosrc/puma/storage/+src/puma/preflight/src/data_prep.py→ migrate tosrc/puma/datasets/src/statistical_analysis.py→ migrate tosrc/puma/metrics/tests/test_core.py→ split intotests/unit/andtests/integration/data/jira_balanced_200.csv→ keep indata/during transitiondata/tawos_clean.csv→ keep indata/during transitiondb/TAWOS.sql→ source forsrc/puma/datasets/tawos.py
10. Open Questions (Phase 0)¶
See docs/open_questions.md for decisions logged during implementation.