LN: Wohlin et al. (2012) — Experimentation in Software Engineering
Bibliographic Reference
Citation: Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., & Wesslén, A. (2012). Experimentation in software engineering. Springer. https://doi.org/10.1007/978-3-642-29044-2
Pass 1 — Bird’s Eye View (5 Cs)
| C | Assessment |
|---|---|
| Category | Research methodology textbook |
| Context | The standard SE experimentation textbook; originally published in 2000, revised 2012. Used across academic SE research |
| Correctness | Based on established experimental design and statistical theory; citations trace to Shadish, Cook & Campbell (validity framework) and Fisher (experimental design) |
| Contributions | (1) Taxonomy of SE study types (experiment, case study, survey); (2) Controlled experiment design (factor, treatment, response variable); (3) Validity threat taxonomy (internal, external, construct, conclusion); (4) Statistical analysis guidance (parametric vs. non-parametric) |
| Clarity | Very good — structured, with SE-specific worked examples |
Relevance: ⭐⭐⭐⭐⭐
PUMA’s experimental design follows Wohlin et al. directly: 2-factor controlled experiment (model × prompting strategy), Wilcoxon signed-rank for non-parametric comparison, Shapiro-Wilk for normality testing, and effect size r = Z/√N.
Pass 2 — Key Concepts
SE Study Types
| Type | Control | Manipulation | Observation |
|---|---|---|---|
| Experiment | Controlled | Yes | Yes |
| Case study | Uncontrolled | No | Yes |
| Survey | None | No | Yes |
PUMA is an experiment: controlled conditions (fixed dataset, seed=42, temperature=0), manipulated factors (model choice, prompting strategy), measured response variables (F1-macro, MAE).
Experimental Design for PUMA
Wohlin et al.’s factorial design maps directly:
| Concept | PUMA Variable |
|---|---|
| Factors | LLM Model × Prompting Strategy |
| Treatments | 4 models × 4 strategies = 16 conditions |
| Response variable | F1-macro (H1), MAE (H2) |
| Experimental units | Issues in dataset (n=200 stratified) |
| Blocking | Stratified sampling by priority class |
Validity Threat Framework (PUMA mapping)
| Threat Type | Definition | PUMA Mitigation |
|---|---|---|
| Internal validity | Confounds affect results | Fixed seed=42, temperature=0, identical prompts |
| External validity | Results don’t generalise | Open dataset (Jira SR, TAWOS); reproducibility protocol |
| Construct validity | Metrics don’t measure what we claim | F1-macro validated against expert labels; MAE against human SP estimates |
| Conclusion validity | Statistical analysis errors | Wilcoxon + BH-FDR correction; effect size reported |
Non-Parametric Test Selection
Wohlin et al. recommend:
- Test normality with Shapiro-Wilk (α=0.05)
- If normal: paired t-test
- If not normal: Wilcoxon signed-rank test (PUMA’s choice — expected for F1 distributions)
The Wilcoxon signed-rank test does not assume normality — appropriate for bounded metrics like F1 [0,1] which tend to cluster near 1.0.
PUMA Integration
- Ch.3 Methods — Experimental Design: Cite Wohlin et al. for the factorial design and validity threat analysis
- Statistical Analysis: Wohlin provides the justification for choosing Wilcoxon over t-test
Related Notes
- PN-StatisticalValidation-Full — full Wilcoxon pipeline with Python code
- PN-Wilcoxon-FINER-Cornell-PRISMA — Wilcoxon + FINER + PRISMA integration
- LN-Hevner-2004-DSR — DSR as the overarching paradigm
- EX-Hypotheses-H1-H2 — PUMA’s hypotheses framed per Wohlin