PUMA tool selection follows a hierarchy: reproducibility first, then capability, then convenience

Every tool in PUMA’s stack was chosen through a consistent decision hierarchy. Understanding this hierarchy explains why specific tools were selected over seemingly more capable alternatives.


The Decision Hierarchy

1. Reproducibility (non-negotiable — Constitution Article 1) → Does this tool produce deterministic output given fixed inputs? → Can any researcher reproduce results without paying for proprietary access? → Are versions pinnable (no silent updates that change behaviour)?

2. Capability (sufficient, not maximal) → Can this tool meet the minimum threshold for the task? (F1 ≥ 0.55, MAE ≤ 3.0 SP) → Is it capable enough, not necessarily most capable?

3. Convenience (last consideration) → After (1) and (2) are satisfied, prefer simpler and faster tools


How This Drives Specific Choices

Local Ollama over cloud APIs: GPT-4 has higher capability ceiling, but cloud API results change as model updates silently. Ollama + pinned model digest = bit-identical reproduction. Reproducibility wins.

Mistral 7B over Llama 3.2 70B: 70B has higher capability, but requires GPU hardware beyond 16GB RAM. 7B is capable enough for the task while meeting the hardware constraint. Capability (sufficient) wins over maximal capability.

ChromaDB → Qdrant migration: ChromaDB is simpler (convenience), but Qdrant supports metadata filtering needed for Stage 4 production (capability). Capability wins at Stage 4.

Poetry over requirements.txt: Both achieve dependency pinning, but Poetry handles transitive dependency conflicts correctly (reproducibility over convenience of a simpler tool).

Pydantic AI for schema validation: Convenience would be string parsing. Pydantic ensures outputs conform to schema even when the LLM produces slight format variations (reliability/reproducibility).


The Fallback Stack

For each primary tool, a fallback exists that maintains the same reproducibility principle:

PrimaryFallbackReason
Llama 3.2 8BPhi-3.5 Mini 3.8BLatency risk: <30s guaranteed
OllamaLM StudioInstallation failure risk
QdrantChromaDBSetup complexity risk
LangGraphAutoGenCyclic graph complexity risk
FastAPIFlaskFramework overhead risk

References

MOCs