PUMA Architecture Specification v1.0

Spec-First Document

Constitutional Preamble

Overview

This specification is written BEFORE implementation begins. All changes require version increment.

System Overview

┌─────────────────────────────────────────────────────────────────┐
│                        PUMA BENCHMARK                           │
├──────────┬─────────────────────────┬──────────┬────────────────┤
│  INPUT   │     AGENT SWARM         │GOVERNANCE│   OUTPUTS      │
│  DATA    │   (LOCAL LLMs)          │& CONTROL │ & ARTEFACTS    │
├──────────┼─────────────────────────┼──────────┼────────────────┤
│          │                         │          │                │
│ Jira SR  │ ┌──────────────────┐   │ Ollama   │ Performance    │
│ (CSV)    │ │  Triage Agent    │   │ Orchestr.│ Report         │
│          │ │  (Stage 1)       │   │          │                │
│ TAWOS    │ └─────────┬────────┘   │ JSON     │ Reproducible   │
│ (CSV)    │           │            │ Schema & │ Metrics        │
│          │ ┌─────────▼────────┐   │ Ethics   │ (F1, MAE)      │
│          │ │ Estimation Agent │   │ Guardr.  │                │
│          │ │  (Stage 2)       │   │          │ Open Source    │
│          │ └──────────────────┘   │ Human-   │ Code           │
│          │                        │ in-Loop  │ (MIT Licence)  │
│          │ Models:                │ Validat. │                │
│          │ • Llama 3.2 8B         │          │ Academic Paper │
│          │ • Mistral 7B           │ CodeCarb.│ (PUMA Repo)     │
│          │ • Phi-3.5 Mini*        │ Tracking │                │
└──────────┴─────────────────────────┴──────────┴────────────────┘
* Fallback model if latency > 60s

Component Specifications

Component 1: Data Loader

component: DataLoader
purpose: Load, validate, and prepare evaluation subsets from raw datasets
inputs:
  - name: dataset_path
    type: str
    description: Path to raw CSV file
  - name: dataset_type
    type: str
    values: [jira-sr, tawos]
  - name: n_per_class
    type: int
    description: Number of samples per class for stratified subset
    default: 50
  - name: seed
    type: int
    default: 42
outputs:
  - name: evaluation_subset
    type: pd.DataFrame
    description: Balanced, validated evaluation dataset
  - name: metadata
    type: dict
    description: Dataset statistics and composition report
constraints:
  - "Must produce identical output for same seed across Python versions"
  - "Must validate all required columns are present"
  - "Must document class distribution in output metadata"

Component 2: Triage Agent

component: TriageAgent
purpose: Classify issue priority using configured LLM and prompting strategy
inputs:
  - name: issue
    type: dict
    fields: [title (str), description (str, optional)]
  - name: model
    type: str
    values: [llama3.2:8b, mistral:7b, phi3.5:3.8b]
  - name: strategy
    type: str
    values: [zero-shot, few-shot-3, few-shot-6, cot]
outputs:
  - name: prediction
    type: dict
    fields: 
      - predicted_priority: str (Critical|High|Medium|Low|Unknown)
      - raw_response: str
      - latency_s: float
      - emissions_gco2: float
      - model: str
      - strategy: str
acceptance_criteria:
  - "Reproducible: same input + seed=42 + temp=0 → same output"
  - "Latency < 60s on 16GB RAM CPU"
  - "Never crashes on unexpected model output — returns 'Unknown'"
  - "Logs all emissions via CodeCarbon"

Component 3: Estimation Agent

component: EstimationAgent  
purpose: Estimate story points for Agile user stories
inputs:
  - name: story
    type: dict
    fields: [title, description, acceptance_criteria (optional)]
  - name: model
    type: str
  - name: strategy
    type: str
    values: [zero-shot, few-shot-3, cot]
outputs:
  - name: prediction
    type: dict
    fields:
      - predicted_sp: int (1|2|3|5|8|13|21)
      - raw_response: str
      - latency_s: float
      - emissions_gco2: float
acceptance_criteria:
  - "Output must be a valid Fibonacci number from: 1,2,3,5,8,13,21"
  - "Invalid outputs → log + return median of training distribution (5)"
  - "Latency < 60s"

Component 4: Evaluation Engine

component: EvaluationEngine
purpose: Calculate all metrics and run statistical tests
inputs:
  - name: predictions
    type: list[dict]
  - name: ground_truth
    type: list[str|int]
  - name: task
    type: str
    values: [triage, estimation]
outputs:
  - name: metrics
    type: dict
    fields:
      triage: [f1_macro, f1_per_class, precision, recall, wilcoxon_stat, p_value, effect_r]
      estimation: [mae, rmse, wilcoxon_stat, p_value, effect_r]
  - name: report
    type: str (markdown)
acceptance_criteria:
  - "F1 calculated with sklearn.metrics.f1_score(average='macro')"
  - "Wilcoxon: scipy.stats.wilcoxon (two-sided, alpha=0.05)"
  - "Effect size: r = Z / sqrt(N) where Z from Wilcoxon"

Technology Stack

Layer	Tool	Version	Justification
Inference	Ollama	≥0.5.0	Local, deterministic, no GPU needed
Models	Llama 3.2 8B	latest	Strong baseline, Meta open-weights
Models	Mistral 7B	latest	Alternative architecture
Data	pandas	2.x	Standard, well-documented
Metrics	scikit-learn	1.4.x	F1, precision, recall
Stats	scipy	1.12.x	Wilcoxon test
Carbon	CodeCarbon	2.x	gCO₂eq per condition
Viz	matplotlib, seaborn	latest	Reproducible plots
Notebooks	Jupyter	≥7.0	Interactive analysis
Version	Git + GitHub	—	Public MIT repository

BDD Acceptance Scenarios

Feature: PUMA Reproducibility Guarantee
  
  Scenario: Identical results across fresh environments
    Given a clean Python 3.11 environment
    And requirements installed from requirements.txt
    And Ollama running with seed=42 configured
    When the benchmark runner executes on the evaluation subset
    Then results must match stored baseline within floating-point tolerance
    And the PRISMA/run report must be regenerated identically
 
Feature: MVP Success Condition
  
  Scenario: MVP threshold met (H1)
    Given the triage experiment runs on Jira SR subset
    When results are evaluated
    Then at least one configuration must achieve F1-macro >= 0.55
    And the Wilcoxon test must return p < 0.05 vs heuristic baseline

Frameworks: PN-SDD-Framework | PN-DSR-SLR-Methods

Project governance: SP-PUMA-Constitution — Non-negotiable principles | SP-Triage-Agent — Triage agent spec | SP-Estimation-Dataset-Specs

Experiments & agents: EX-Hypotheses-H1-H2 | BMAD-Agent-Roster

Core concepts: PN-KeyConcepts-Agents-Reproducibility-RedTeam (Agent OS, Reproducibility) | PN-LLM-Local-vs-Cloud (Local inference rationale)

Prompts: PT-DevTools-Prompts

Navigation: MOC-PUMA-Master | MOC-Tools-Stack

PUMA Vault

Explorador

PUMA Architecture Specification v1.0

PUMA Architecture Specification v1.0

System Overview

Component Specifications

Component 1: Data Loader

Component 2: Triage Agent

Component 3: Estimation Agent

Component 4: Evaluation Engine

Technology Stack

BDD Acceptance Scenarios

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

PUMA Architecture Specification v1.0

PUMA Architecture Specification v1.0

System Overview

Component Specifications

Component 1: Data Loader

Component 2: Triage Agent

Component 3: Estimation Agent

Component 4: Evaluation Engine

Technology Stack

BDD Acceptance Scenarios

🔗 Related Notes

Vista Gráfica

Tabla de Contenidos

Retroenlaces