LN: Wohlin et al. (2012) — Experimentation in Software Engineering

Bibliographic Reference

Citation: Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., & Wesslén, A. (2012). Experimentation in software engineering. Springer. https://doi.org/10.1007/978-3-642-29044-2


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryResearch methodology textbook
ContextThe standard SE experimentation textbook; originally published in 2000, revised 2012. Used across academic SE research
CorrectnessBased on established experimental design and statistical theory; citations trace to Shadish, Cook & Campbell (validity framework) and Fisher (experimental design)
Contributions(1) Taxonomy of SE study types (experiment, case study, survey); (2) Controlled experiment design (factor, treatment, response variable); (3) Validity threat taxonomy (internal, external, construct, conclusion); (4) Statistical analysis guidance (parametric vs. non-parametric)
ClarityVery good — structured, with SE-specific worked examples

Relevance: ⭐⭐⭐⭐⭐

PUMA’s experimental design follows Wohlin et al. directly: 2-factor controlled experiment (model × prompting strategy), Wilcoxon signed-rank for non-parametric comparison, Shapiro-Wilk for normality testing, and effect size r = Z/√N.


Pass 2 — Key Concepts

SE Study Types

TypeControlManipulationObservation
ExperimentControlledYesYes
Case studyUncontrolledNoYes
SurveyNoneNoYes

PUMA is an experiment: controlled conditions (fixed dataset, seed=42, temperature=0), manipulated factors (model choice, prompting strategy), measured response variables (F1-macro, MAE).

Experimental Design for PUMA

Wohlin et al.’s factorial design maps directly:

ConceptPUMA Variable
FactorsLLM Model × Prompting Strategy
Treatments4 models × 4 strategies = 16 conditions
Response variableF1-macro (H1), MAE (H2)
Experimental unitsIssues in dataset (n=200 stratified)
BlockingStratified sampling by priority class

Validity Threat Framework (PUMA mapping)

Threat TypeDefinitionPUMA Mitigation
Internal validityConfounds affect resultsFixed seed=42, temperature=0, identical prompts
External validityResults don’t generaliseOpen dataset (Jira SR, TAWOS); reproducibility protocol
Construct validityMetrics don’t measure what we claimF1-macro validated against expert labels; MAE against human SP estimates
Conclusion validityStatistical analysis errorsWilcoxon + BH-FDR correction; effect size reported

Non-Parametric Test Selection

Wohlin et al. recommend:

  1. Test normality with Shapiro-Wilk (α=0.05)
  2. If normal: paired t-test
  3. If not normal: Wilcoxon signed-rank test (PUMA’s choice — expected for F1 distributions)

The Wilcoxon signed-rank test does not assume normality — appropriate for bounded metrics like F1 [0,1] which tend to cluster near 1.0.


PUMA Integration

  • Ch.3 Methods — Experimental Design: Cite Wohlin et al. for the factorial design and validity threat analysis
  • Statistical Analysis: Wohlin provides the justification for choosing Wilcoxon over t-test

MOCs