Experiment Overview — Stage 1: Issue Triage
Tests H1. Full design for all 8 experimental conditions + 2 baselines.
Overview
Individual runs: create from Template-Experiment-Note
Experimental Design Matrix
| Condition | Model | Strategy | Status | Note |
|---|---|---|---|---|
| B1 | Heuristic keywords | — | ⏳ | Baseline 1 |
| B2 | TF-IDF + SVM | — | ⏳ | Baseline 2 |
| C1 | llama3.2:8b | zero-shot | ⏳ | EX-Llama32-ZeroShot-Triage |
| C2 | llama3.2:8b | few-shot-3 | ⏳ | EX-Llama32-FewShot3-Triage |
| C3 | llama3.2:8b | few-shot-6 | ⏳ | EX-Llama32-FewShot6-Triage |
| C4 | llama3.2:8b | cot | ⏳ | EX-Llama32-CoT-Triage |
| C5 | mistral:7b | zero-shot | ⏳ | EX-Mistral7B-ZeroShot-Triage |
| C6 | mistral:7b | few-shot-3 | ⏳ | EX-Mistral7B-FewShot3-Triage |
| C7 | mistral:7b | few-shot-6 | ⏳ | EX-Mistral7B-FewShot6-Triage |
| C8 | mistral:7b | cot | ⏳ | EX-Mistral7B-CoT-Triage |
Total conditions: 10 (8 LLM + 2 baseline) Total inferences: 10 × 200 issues = 2,000 calls Estimated time: ~15–40 s/call → 8–22 hours total on CPU
Fixed Parameters (All Conditions)
| Parameter | Value |
|---|---|
| Dataset | Jira SR — stratified subset |
| Subset size | 200 issues (50 per priority class) |
| seed | 42 |
| temperature | 0 |
| Ollama version | ≥ 0.5.0 |
| Hardware | 16 GB RAM, CPU-only |
| Carbon tracking | CodeCarbon per call |
Primary Metric: F1-macro
- Formula:
sklearn.metrics.f1_score(y_true, y_pred, average='macro') - MVP threshold: ≥ 0.55
- Desired threshold: ≥ 0.70
- Statistical test: Wilcoxon signed-rank vs B1 (heuristic baseline), α = 0.05
Prompt Templates Used
| Strategy | Template |
|---|---|
| zero-shot | PT-PUMA-Experiment-Prompts → TRIAGE_ZERO_SHOT_TEMPLATE |
| few-shot-3 | PT-PUMA-Experiment-Prompts → TRIAGE_FEW_SHOT_3_TEMPLATE |
| few-shot-6 | PT-PUMA-Experiment-Prompts → TRIAGE_FEW_SHOT_6_TEMPLATE |
| cot | PT-PUMA-Experiment-Prompts → TRIAGE_COT_TEMPLATE |
Results Aggregation (to fill during F4)
TABLE model, strategy, f1_macro, wilcoxon_p, effect_r, emissions_gco2
FROM "40 - Projects/PUMA/41.7 Experiments/Stage1-Triage"
WHERE type = "experiment" AND stage = "1-triage"
SORT f1_macro DESCRun Order & Risk Mitigation
Recommended run order:
- B1 (heuristic) — fastest, establishes baseline immediately
- C1 (llama3.2, zero-shot) — simplest LLM condition, diagnoses setup
- C5 (mistral, zero-shot) — cross-model comparison at simplest condition
- C2–C4 (llama3.2 remaining strategies)
- C6–C8 (mistral remaining strategies)
- B2 (SVM baseline) — most complex baseline, run last
If latency > 60s/call: Switch to Phi-3.5 Mini (3.8B) as fallback model. If F1 < 0.40 for all LLM conditions: Extend error analysis; consider this a negative result (still publishable).
🔗 Related Notes
Hypotheses & governance: EX-Hypotheses-H1-H2 · SP-PUMA-Constitution (Art. 1, 3)
Specs: SP-Triage-Agent · SP-Architecture · SP-Estimation-Dataset-Specs
Datasets: LN-Datasets-JiraSR-TAWOS
Prompts: PT-PUMA-Experiment-Prompts
PM concepts: PN-IssueTriage-StoryPoints · PN-CoT-FewShot-Prompting
Navigation: PR-PUMA-Ch4-Results · MOC-PUMA-Master
id: EX-Stage2-Effort-Overview title: “Experiment Overview — Stage 2: Effort Estimation” type: experiment tags: [experiment, estimation, stage2, overview, h2, story-points] stage: “2-estimation” status: planned created: 2026-04-10
Experiment Overview — Stage 2: Effort Estimation
Tests H2. Six LLM conditions + three baselines. Depends on EX-Stage1-Triage-Overview being complete (MVP validated).
Experimental Design Matrix
| Condition | Model | Strategy | Status |
|---|---|---|---|
| B1 | Historical mean | — | ⏳ |
| B2 | Deep-SE (reported value) | — | ⏳ (literature) |
| B3 | CoGEE/GPT-4 (reported) | — | ⏳ (literature) |
| C1 | llama3.2:8b | zero-shot | ⏳ |
| C2 | llama3.2:8b | few-shot-3 | ⏳ |
| C3 | llama3.2:8b | cot | ⏳ |
| C4 | mistral:7b | zero-shot | ⏳ |
| C5 | mistral:7b | few-shot-3 | ⏳ |
| C6 | mistral:7b | cot | ⏳ |
Total PUMA conditions: 6 LLM (+ 3 from literature) Total inferences: 6 × 350 stories = 2,100 calls
Fixed Parameters
| Parameter | Value |
|---|---|
| Dataset | TAWOS — stratified subset |
| Subset size | 350 stories (50 per SP class × 7 classes) |
| seed | 42 · temperature |
| Fibonacci scale | {1, 2, 3, 5, 8, 13, 21} |
| Carbon tracking | CodeCarbon per call |
Primary Metric: MAE
- Formula:
mean(abs(predicted_sp - actual_sp)) - MVP threshold: ≤ 3.0 SP
- Desired threshold: ≤ 1.5 SP
- Statistical test: Wilcoxon vs B1 (historical mean), α = 0.05
🔗 Related Notes
Stage 1 context: EX-Hypotheses-H1-H2 (H2) · SP-Estimation-Dataset-Specs
Dataset & baseline: LN-Datasets-JiraSR-TAWOS (TAWOS) · LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg (CoGEE baseline)
PM concepts: PN-IssueTriage-StoryPoints (MAE, Fibonacci scale) · PN-CoT-FewShot-Prompting (strategies S1–S3)
Navigation: PR-PUMA-Ch4-Results · MOC-PUMA-Master