PUMA Research Hypotheses — H1 & H2

Main Research Question

Overview

Are there statistically significant differences in quality of automatic issue triage and effort estimation when using different LLMs and prompting strategies, evaluated on real ICT project datasets with verified reference labels?


H1 — Issue Triage Hypothesis

Formulation (Popper falsifiability framework)

H₀₁ (Null — refutable): No LLM + strategy configuration will achieve F1-macro superior to the keyword-based heuristic baseline on Jira SR. Any observed difference is explained by random variation (Wilcoxon bilateral test, α = 0.05).

H₁ (Alternative): At least one configuration achieves F1-macro statistically superior (p < 0.05) to the heuristic baseline, with effect size r ≥ 0.1 computed over per-class score distributions.

Falsification condition: If NO configuration beats the baseline with statistical significance under the defined protocol, H₀₁ is NOT rejected. Conclusion: local LLMs evaluated do not provide measurable improvement over keyword heuristics under these conditions.

Operational Variables

Variable typeVariableValue/Range
IndependentLLM model{llama3.2:8b, mistral:7b}
IndependentPrompting strategy{zero-shot, few-shot-3, few-shot-6, cot}
DependentF1-macro0.0–1.0
DependentPrecision per class0.0–1.0
DependentRecall per class0.0–1.0
Controlseed42
Controltemperature0
ControlDataset subset200 issues, stratified (50/class)
ControlHardware16GB RAM CPU

Baseline Definition

Heuristic baseline (keyword classifier):

PRIORITY_KEYWORDS = {
    'Critical': ['crash', 'down', 'production', 'all users', 'data loss', 
                 'security', 'outage', 'failing', '500 error'],
    'High': ['slow', 'performance', 'timeout', 'broken', 'cannot', 'fails'],
    'Low': ['typo', 'colour', 'color', 'alignment', 'documentation', 'tooltip']
}
# Default: Medium

Expected baseline F1-macro: 0.45–0.52 (to be calculated on subset before experiments)

MVP Threshold

ThresholdValueStatus
Minimum (MVP)F1-macro ≥ 0.55Required for MVP success
DesiredF1-macro ≥ 0.70Target for quality contribution

H2 — Effort Estimation Hypothesis

Formulation

H₀₂ (Null — refutable): No few-shot configuration will achieve MAE below the project historical mean as estimator on TAWOS. Any difference is random variation (Wilcoxon bilateral, α = 0.05).

H₂ (Alternative): At least one few-shot configuration achieves MAE statistically below the historical mean (p < 0.05), with absolute reduction ≥ 0.5 story points as practical relevance threshold.

Operational Variables

Variable typeVariableValue/Range
IndependentLLM model{llama3.2:8b, mistral:7b}
IndependentPrompting strategy{zero-shot, few-shot-3, cot}
DependentMAE (story points)≥ 0.0 SP
DependentRMSE≥ 0.0 SP
Controlseed, temperature42, 0
ControlFibonacci scale{1,2,3,5,8,13,21}

Reference Baselines

BaselineMAESource
Historical mean per project~3.5 SPTAWOS dataset statistics
Deep-SE~3.2 SPChoetkiertikul et al. (2018)
CoGEE (GPT-4)~1.9 SPTawosi et al. (2024)
PUMA minimum (H2)≤ 3.0 SPMVP threshold
PUMA desired≤ 1.5 SPCompetitive with CoGEE

Statistical Analysis Protocol

Test: Wilcoxon Signed-Rank

from scipy import stats
import numpy as np
 
def test_hypothesis(model_scores: list, baseline_scores: list, 
                    alpha: float = 0.05) -> dict:
    """
    Two-sided Wilcoxon signed-rank test per PUMA protocol.
    Effect size r = Z / sqrt(N) where Z is from normal approximation.
    """
    stat, p_value = stats.wilcoxon(model_scores, baseline_scores, 
                                    alternative='two-sided')
    N = len(model_scores)
    # Approximate Z for effect size
    z = stats.norm.ppf(1 - p_value/2)
    effect_r = abs(z) / np.sqrt(N)
    
    return {
        'statistic': stat,
        'p_value': p_value,
        'effect_size_r': effect_r,
        'is_significant': p_value < alpha,
        'effect_category': 'small' if effect_r < 0.3 else 
                          'medium' if effect_r < 0.5 else 'large',
        'h0_rejected': p_value < alpha and effect_r >= 0.1
    }

Reporting Standards (Kitchenham et al., 2002)

Every result table must report:

  • N (sample size)
  • Mean ± SD for both conditions
  • Wilcoxon W statistic
  • p-value (exact, not ”> 0.05”)
  • Effect size r with confidence interval
  • Practical significance threshold met? (Y/N)

Experiment Tracking

TABLE model, strategy, f1_macro, mae, wilcoxon_p, status
FROM "40 - Projects/PUMA/41.7 Experiments"
WHERE type = "experiment"
SORT stage ASC, f1_macro DESC

Experiment stages: EX-Stages-Overview

PM concepts (H1 + H2 targets): PN-IssueTriage-StoryPoints — F1-macro (H1) + MAE (H2) definitions · PN-LLM-Local-vs-Cloud — Local model constraints

Prompting strategies (independent variables): PN-CoT-FewShot-Prompting — CoT + Few-Shot theory · ST-Prompting-Strategies — Strategy cluster

Datasets: LN-Datasets-JiraSR-TAWOS — Jira SR (H1) + TAWOS (H2)

Statistical protocol: PN-Wilcoxon-FINER-Cornell-PRISMA — Wilcoxon + effect size · ST-Reproducibility-Cluster

Project integration: PR-PUMA-Ch3-Methods — Full methodology · PR-PUMA-Ch4-Results — Where results will go · SP-PUMA-Constitution — Falsifiability principle (Art. 3)

Navigation: MOC-PUMA-Master · MOC-LLM-Benchmarks-PM-AI