Experiment Overview — Stage 1: Issue Triage

Tests H1. Full design for all 8 experimental conditions + 2 baselines.

Overview

Individual runs: create from Template-Experiment-Note


Experimental Design Matrix

ConditionModelStrategyStatusNote
B1Heuristic keywordsBaseline 1
B2TF-IDF + SVMBaseline 2
C1llama3.2:8bzero-shotEX-Llama32-ZeroShot-Triage
C2llama3.2:8bfew-shot-3EX-Llama32-FewShot3-Triage
C3llama3.2:8bfew-shot-6EX-Llama32-FewShot6-Triage
C4llama3.2:8bcotEX-Llama32-CoT-Triage
C5mistral:7bzero-shotEX-Mistral7B-ZeroShot-Triage
C6mistral:7bfew-shot-3EX-Mistral7B-FewShot3-Triage
C7mistral:7bfew-shot-6EX-Mistral7B-FewShot6-Triage
C8mistral:7bcotEX-Mistral7B-CoT-Triage

Total conditions: 10 (8 LLM + 2 baseline) Total inferences: 10 × 200 issues = 2,000 calls Estimated time: ~15–40 s/call → 8–22 hours total on CPU


Fixed Parameters (All Conditions)

ParameterValue
DatasetJira SR — stratified subset
Subset size200 issues (50 per priority class)
seed42
temperature0
Ollama version≥ 0.5.0
Hardware16 GB RAM, CPU-only
Carbon trackingCodeCarbon per call

Primary Metric: F1-macro

  • Formula: sklearn.metrics.f1_score(y_true, y_pred, average='macro')
  • MVP threshold: ≥ 0.55
  • Desired threshold: ≥ 0.70
  • Statistical test: Wilcoxon signed-rank vs B1 (heuristic baseline), α = 0.05

Prompt Templates Used

StrategyTemplate
zero-shotPT-PUMA-Experiment-Prompts → TRIAGE_ZERO_SHOT_TEMPLATE
few-shot-3PT-PUMA-Experiment-Prompts → TRIAGE_FEW_SHOT_3_TEMPLATE
few-shot-6PT-PUMA-Experiment-Prompts → TRIAGE_FEW_SHOT_6_TEMPLATE
cotPT-PUMA-Experiment-Prompts → TRIAGE_COT_TEMPLATE

Results Aggregation (to fill during F4)

TABLE model, strategy, f1_macro, wilcoxon_p, effect_r, emissions_gco2
FROM "40 - Projects/PUMA/41.7 Experiments/Stage1-Triage"
WHERE type = "experiment" AND stage = "1-triage"
SORT f1_macro DESC

Run Order & Risk Mitigation

Recommended run order:

  1. B1 (heuristic) — fastest, establishes baseline immediately
  2. C1 (llama3.2, zero-shot) — simplest LLM condition, diagnoses setup
  3. C5 (mistral, zero-shot) — cross-model comparison at simplest condition
  4. C2–C4 (llama3.2 remaining strategies)
  5. C6–C8 (mistral remaining strategies)
  6. B2 (SVM baseline) — most complex baseline, run last

If latency > 60s/call: Switch to Phi-3.5 Mini (3.8B) as fallback model. If F1 < 0.40 for all LLM conditions: Extend error analysis; consider this a negative result (still publishable).


Hypotheses & governance: EX-Hypotheses-H1-H2 · SP-PUMA-Constitution (Art. 1, 3)

Specs: SP-Triage-Agent · SP-Architecture · SP-Estimation-Dataset-Specs

Datasets: LN-Datasets-JiraSR-TAWOS

Prompts: PT-PUMA-Experiment-Prompts

PM concepts: PN-IssueTriage-StoryPoints · PN-CoT-FewShot-Prompting

Navigation: PR-PUMA-Ch4-Results · MOC-PUMA-Master



id: EX-Stage2-Effort-Overview title: “Experiment Overview — Stage 2: Effort Estimation” type: experiment tags: [experiment, estimation, stage2, overview, h2, story-points] stage: “2-estimation” status: planned created: 2026-04-10

Experiment Overview — Stage 2: Effort Estimation

Tests H2. Six LLM conditions + three baselines. Depends on EX-Stage1-Triage-Overview being complete (MVP validated).


Experimental Design Matrix

ConditionModelStrategyStatus
B1Historical mean
B2Deep-SE (reported value)⏳ (literature)
B3CoGEE/GPT-4 (reported)⏳ (literature)
C1llama3.2:8bzero-shot
C2llama3.2:8bfew-shot-3
C3llama3.2:8bcot
C4mistral:7bzero-shot
C5mistral:7bfew-shot-3
C6mistral:7bcot

Total PUMA conditions: 6 LLM (+ 3 from literature) Total inferences: 6 × 350 stories = 2,100 calls


Fixed Parameters

ParameterValue
DatasetTAWOS — stratified subset
Subset size350 stories (50 per SP class × 7 classes)
seed42 · temperature
Fibonacci scale{1, 2, 3, 5, 8, 13, 21}
Carbon trackingCodeCarbon per call

Primary Metric: MAE

  • Formula: mean(abs(predicted_sp - actual_sp))
  • MVP threshold: ≤ 3.0 SP
  • Desired threshold: ≤ 1.5 SP
  • Statistical test: Wilcoxon vs B1 (historical mean), α = 0.05

Stage 1 context: EX-Hypotheses-H1-H2 (H2) · SP-Estimation-Dataset-Specs

Dataset & baseline: LN-Datasets-JiraSR-TAWOS (TAWOS) · LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg (CoGEE baseline)

PM concepts: PN-IssueTriage-StoryPoints (MAE, Fibonacci scale) · PN-CoT-FewShot-Prompting (strategies S1–S3)

Navigation: PR-PUMA-Ch4-Results · MOC-PUMA-Master