PUMA Architecture Specification v1.0

Spec-First Document

Constitutional Preamble

Overview

This specification is written BEFORE implementation begins. All changes require version increment.


System Overview

┌─────────────────────────────────────────────────────────────────┐
│                        PUMA BENCHMARK                           │
├──────────┬─────────────────────────┬──────────┬────────────────┤
│  INPUT   │     AGENT SWARM         │GOVERNANCE│   OUTPUTS      │
│  DATA    │   (LOCAL LLMs)          │& CONTROL │ & ARTEFACTS    │
├──────────┼─────────────────────────┼──────────┼────────────────┤
│          │                         │          │                │
│ Jira SR  │ ┌──────────────────┐   │ Ollama   │ Performance    │
│ (CSV)    │ │  Triage Agent    │   │ Orchestr.│ Report         │
│          │ │  (Stage 1)       │   │          │                │
│ TAWOS    │ └─────────┬────────┘   │ JSON     │ Reproducible   │
│ (CSV)    │           │            │ Schema & │ Metrics        │
│          │ ┌─────────▼────────┐   │ Ethics   │ (F1, MAE)      │
│          │ │ Estimation Agent │   │ Guardr.  │                │
│          │ │  (Stage 2)       │   │          │ Open Source    │
│          │ └──────────────────┘   │ Human-   │ Code           │
│          │                        │ in-Loop  │ (MIT Licence)  │
│          │ Models:                │ Validat. │                │
│          │ • Llama 3.2 8B         │          │ Academic Paper │
│          │ • Mistral 7B           │ CodeCarb.│ (PUMA Repo)     │
│          │ • Phi-3.5 Mini*        │ Tracking │                │
└──────────┴─────────────────────────┴──────────┴────────────────┘
* Fallback model if latency > 60s

Component Specifications

Component 1: Data Loader

component: DataLoader
purpose: Load, validate, and prepare evaluation subsets from raw datasets
inputs:
  - name: dataset_path
    type: str
    description: Path to raw CSV file
  - name: dataset_type
    type: str
    values: [jira-sr, tawos]
  - name: n_per_class
    type: int
    description: Number of samples per class for stratified subset
    default: 50
  - name: seed
    type: int
    default: 42
outputs:
  - name: evaluation_subset
    type: pd.DataFrame
    description: Balanced, validated evaluation dataset
  - name: metadata
    type: dict
    description: Dataset statistics and composition report
constraints:
  - "Must produce identical output for same seed across Python versions"
  - "Must validate all required columns are present"
  - "Must document class distribution in output metadata"

Component 2: Triage Agent

component: TriageAgent
purpose: Classify issue priority using configured LLM and prompting strategy
inputs:
  - name: issue
    type: dict
    fields: [title (str), description (str, optional)]
  - name: model
    type: str
    values: [llama3.2:8b, mistral:7b, phi3.5:3.8b]
  - name: strategy
    type: str
    values: [zero-shot, few-shot-3, few-shot-6, cot]
outputs:
  - name: prediction
    type: dict
    fields: 
      - predicted_priority: str (Critical|High|Medium|Low|Unknown)
      - raw_response: str
      - latency_s: float
      - emissions_gco2: float
      - model: str
      - strategy: str
acceptance_criteria:
  - "Reproducible: same input + seed=42 + temp=0 → same output"
  - "Latency < 60s on 16GB RAM CPU"
  - "Never crashes on unexpected model output — returns 'Unknown'"
  - "Logs all emissions via CodeCarbon"

Component 3: Estimation Agent

component: EstimationAgent  
purpose: Estimate story points for Agile user stories
inputs:
  - name: story
    type: dict
    fields: [title, description, acceptance_criteria (optional)]
  - name: model
    type: str
  - name: strategy
    type: str
    values: [zero-shot, few-shot-3, cot]
outputs:
  - name: prediction
    type: dict
    fields:
      - predicted_sp: int (1|2|3|5|8|13|21)
      - raw_response: str
      - latency_s: float
      - emissions_gco2: float
acceptance_criteria:
  - "Output must be a valid Fibonacci number from: 1,2,3,5,8,13,21"
  - "Invalid outputs → log + return median of training distribution (5)"
  - "Latency < 60s"

Component 4: Evaluation Engine

component: EvaluationEngine
purpose: Calculate all metrics and run statistical tests
inputs:
  - name: predictions
    type: list[dict]
  - name: ground_truth
    type: list[str|int]
  - name: task
    type: str
    values: [triage, estimation]
outputs:
  - name: metrics
    type: dict
    fields:
      triage: [f1_macro, f1_per_class, precision, recall, wilcoxon_stat, p_value, effect_r]
      estimation: [mae, rmse, wilcoxon_stat, p_value, effect_r]
  - name: report
    type: str (markdown)
acceptance_criteria:
  - "F1 calculated with sklearn.metrics.f1_score(average='macro')"
  - "Wilcoxon: scipy.stats.wilcoxon (two-sided, alpha=0.05)"
  - "Effect size: r = Z / sqrt(N) where Z from Wilcoxon"

Technology Stack

LayerToolVersionJustification
InferenceOllama≥0.5.0Local, deterministic, no GPU needed
ModelsLlama 3.2 8BlatestStrong baseline, Meta open-weights
ModelsMistral 7BlatestAlternative architecture
Datapandas2.xStandard, well-documented
Metricsscikit-learn1.4.xF1, precision, recall
Statsscipy1.12.xWilcoxon test
CarbonCodeCarbon2.xgCO₂eq per condition
Vizmatplotlib, seabornlatestReproducible plots
NotebooksJupyter≥7.0Interactive analysis
VersionGit + GitHubPublic MIT repository

BDD Acceptance Scenarios

Feature: PUMA Reproducibility Guarantee
  
  Scenario: Identical results across fresh environments
    Given a clean Python 3.11 environment
    And requirements installed from requirements.txt
    And Ollama running with seed=42 configured
    When the benchmark runner executes on the evaluation subset
    Then results must match stored baseline within floating-point tolerance
    And the PRISMA/run report must be regenerated identically
 
Feature: MVP Success Condition
  
  Scenario: MVP threshold met (H1)
    Given the triage experiment runs on Jira SR subset
    When results are evaluated
    Then at least one configuration must achieve F1-macro >= 0.55
    And the Wilcoxon test must return p < 0.05 vs heuristic baseline

Frameworks: PN-SDD-Framework | PN-DSR-SLR-Methods

Project governance: SP-PUMA-Constitution — Non-negotiable principles | SP-Triage-Agent — Triage agent spec | SP-Estimation-Dataset-Specs

Experiments & agents: EX-Hypotheses-H1-H2 | BMAD-Agent-Roster

Core concepts: PN-KeyConcepts-Agents-Reproducibility-RedTeam (Agent OS, Reproducibility) | PN-LLM-Local-vs-Cloud (Local inference rationale)

Prompts: PT-DevTools-Prompts

Navigation: MOC-PUMA-Master | MOC-Tools-Stack