Wilcoxon Signed-Rank Test

Atomic Claim

Overview

The Wilcoxon signed-rank test is the appropriate non-parametric alternative to the paired t-test when comparing LLM condition scores against baseline scores, because F1 distributions are not assumed to be normally distributed.

The Test in PUMA

from scipy import stats
import numpy as np
 
def wilcoxon_test(llm_scores: list, baseline_scores: list, alpha=0.05) -> dict:
    """
    Two-sided Wilcoxon signed-rank test.
    llm_scores and baseline_scores: per-issue F1 (triage) or per-story |error| (estimation).
    """
    stat, p = stats.wilcoxon(llm_scores, baseline_scores, alternative='two-sided')
    N = len(llm_scores)
    # Effect size r from normal approximation Z
    z = stats.norm.ppf(1 - p / 2) if p > 0 else float('inf')
    r = abs(z) / np.sqrt(N)
    return {
        'W': stat, 'p_value': p,
        'effect_r': round(r, 3),
        'significant': p < alpha,
        'effect_category': 'small' if r < 0.3 else 'medium' if r < 0.5 else 'large',
        'h0_rejected': p < alpha and r >= 0.1
    }

Reporting Template (H1)

“The Wilcoxon signed-rank test revealed that [model+strategy] (Mdn=X) performed significantly [better/worse] than the heuristic baseline (Mdn=Y), W=Z, p=.0XX, r=.XX. This constitutes a [small/medium/large] effect.”

Key Choices Justified

Choice	Justification
Wilcoxon (not t-test)	F1 scores are not normally distributed
Two-sided	H₀ does not assume direction
α = 0.05	Standard in SE research
r ≥ 0.1 for H₀ rejection	Minimum practical significance (Kitchenham 2002)

🔗 EX-Hypotheses-H1-H2 · PN-DSR-SLR-Methods · PR-PUMA-Ch3-Methods (§3.7) · PN-IssueTriage-StoryPoints (F1-macro, MAE metrics)

id: PN-FINER-Criteria title: “FINER Criteria for Research Questions” type: permanent-note category: method tags: [permanent, method, finer, research-question, feasibility] created: 2026-03-01 maturity: evergreen

FINER Criteria for Research Questions

Atomic Claim

A well-formed research question must satisfy all five FINER criteria simultaneously — Feasible, Interesting, Novel, Ethical, Relevant — to constitute a valid academic contribution.

The Framework

Criterion	Question	PUMA Assessment
Feasible	Can this be done with available resources?	✅ Local compute + public datasets + 6 months
Interesting	Does the community care?	✅ PM+AI is top-10 CS research growth area
Novel	Does existing work not answer this?	✅ No open PM+LLM+CodeCarbon benchmark exists
Ethical	Does it respect privacy, fairness, consent?	✅ Public data, open models, HITL design
Relevant	Does it matter for practice?	✅ ICT PM is a billion-dollar industry

🔗 PN-DSR-SLR-Methods · EX-Hypotheses-H1-H2 · PR-PUMA-Ch1-Introduction (§1.2 RQ validation)

id: PN-Cornell-Method title: “Cornell Note-Taking Method” type: permanent-note category: method tags: [permanent, method, cornell, note-taking, learning, mit-student-method] aliases: [“Cornell Notes”, “Cornell System”] created: 2026-03-01 maturity: evergreen

Cornell Note-Taking Method

Atomic Claim

The Cornell system forces active processing of information by physically separating questions/cues (left) from notes (right) and demanding synthesis (bottom) — making it the structural backbone of the MIT Student Method reading process.

Layout

┌─────────────────────────────────────────────────────────┐
│ Note title: [Paper/Chapter]          Date: [date]       │
├──────────────────┬──────────────────────────────────────┤
│  CUE COLUMN      │  NOTES COLUMN                        │
│  (30% width)     │  (70% width)                         │
│                  │                                      │
│  Q: Main claim?  │  [Your notes in your own words]      │
│  Q: Falsifiable? │  [Evidence, examples, data]          │
│  Key: F1-macro   │  [Definitions as used in this paper] │
│  ??              │  [Connections to other notes]        │
│                  │                                      │
├──────────────────┴──────────────────────────────────────┤
│  SUMMARY (20% height — written AFTER reading)           │
│  3 sentences: core claim + evidence + PUMA implication  │
└─────────────────────────────────────────────────────────┘

Rules for Academic Use

Never copy sentences verbatim — paraphrase everything (enforces understanding)
Left column during reading — write questions as they arise, fill answers after
Summary after reading — synthesise without looking at notes first
Review within 24h — cover notes, recite from cues only

In this vault: Applied in Template-Literature-Note-Paper (Cornell Notes section)

🔗 PN-MIT-Student-Method

id: PN-PRISMA-DFLLM title: “PRISMA-DFLLM & PRISMA-trAIce — AI-Extended PRISMA” type: permanent-note category: framework tags: [permanent, framework, prisma, ai-use, transparency, slr-extension] aliases: [“PRISMA-DFLLM”, “PRISMA-trAIce”, “AI-augmented SLR”] created: 2026-03-01 maturity: growing

PRISMA-DFLLM & PRISMA-trAIce

Atomic Claim

When AI tools assist in SLR screening or synthesis, standard PRISMA 2020 reporting must be extended to document: which tasks were AI-assisted, with which models, what outputs were generated, and what human validation was applied — enabling reproducibility of the review process itself.

PRISMA-DFLLM Extension

Adds AI-specific reporting items to standard PRISMA flow:

PRISMA Stage	AI Extension Required
Identification	Document any AI-assisted search query generation
Screening	For each AI-assisted decision: tool, model, prompt, human override rate
Extraction	For AI-extracted data: what was extracted, how validated
Synthesis	For AI-assisted synthesis: what was generated, how rewritten

PRISMA-trAIce Logging Format (PUMA)

Every AI-assisted screening decision requires this record in PRISMA-Log:

Paper: [Zotero key]
AI tool: [Elicit | Claude | Perplexity]
AI model: [version if known]
AI suggestion: [Include | Exclude | Uncertain]
AI reasoning (brief): [what the AI cited]
Human decision: [Include | Exclude]
Agreement: [Y/N]
If disagreement: [explain why human overruled]

Implementation: PRISMA-Log · WF-SLR-Pipeline

🔗 PN-DSR-SLR-Methods · AI-Use-Log

id: PN-Contextual-Anchoring title: “Contextual Anchoring — Prompt Drift Prevention” type: permanent-note category: framework tags: [permanent, framework, prompting, contextual-anchoring, drift, long-context] aliases: [“Contextual Anchoring”, “Prompt Anchoring”, “Context Anchoring”] created: 2026-03-01 maturity: seedling

Contextual Anchoring

Atomic Claim

Re-stating critical constraints at the end of a long prompt (the “anchor”) significantly reduces model drift away from task requirements, exploiting the recency bias of transformer attention mechanisms.

Why It Works

Transformer attention has a mild recency bias — tokens near the end of the context window often receive more attention. By restating constraints as “anchoring reminders” at the end of a long prompt, you reinforce what matters most at the most-attended position.

Pattern

[Long prompt body — role, context, objective, instructions]

--- ANCHOR ---
Before generating: Remember that your response must:
- [Constraint 1 — most important]
- [Constraint 2]
- [Format requirement]

PUMA Use Cases

Long EGI panoramic mapping prompts (prevents scope creep)
Multi-turn research discussions (prevents drift from initial PUMA context)
Experiment prompt templates (prevents models from adding explanation to label-only responses)

Prompt example: PT-Advanced-Prompts-IIPR-Anchoring-AgentOS

🔗 PN-RCOIF-Framework · PN-CoT-FewShot-Prompting

PUMA Vault

Explorador

Methods: Wilcoxon · FINER · Cornell

Wilcoxon Signed-Rank Test

The Test in PUMA

Reporting Template (H1)

Key Choices Justified

id: PN-FINER-Criteria title: “FINER Criteria for Research Questions” type: permanent-note category: method tags: [permanent, method, finer, research-question, feasibility] created: 2026-03-01 maturity: evergreen

FINER Criteria for Research Questions

The Framework

id: PN-Cornell-Method title: “Cornell Note-Taking Method” type: permanent-note category: method tags: [permanent, method, cornell, note-taking, learning, mit-student-method] aliases: [“Cornell Notes”, “Cornell System”] created: 2026-03-01 maturity: evergreen

Cornell Note-Taking Method

Layout

Rules for Academic Use

PRISMA-DFLLM & PRISMA-trAIce

PRISMA-DFLLM Extension

PRISMA-trAIce Logging Format (PUMA)

Contextual Anchoring

Why It Works

Pattern

PUMA Use Cases

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Methods: Wilcoxon · FINER · Cornell

Wilcoxon Signed-Rank Test

The Test in PUMA

Reporting Template (H1)

Key Choices Justified

id: PN-FINER-Criteria title: “FINER Criteria for Research Questions” type: permanent-note category: method tags: [permanent, method, finer, research-question, feasibility] created: 2026-03-01 maturity: evergreen

FINER Criteria for Research Questions

The Framework

id: PN-Cornell-Method title: “Cornell Note-Taking Method” type: permanent-note category: method tags: [permanent, method, cornell, note-taking, learning, mit-student-method] aliases: [“Cornell Notes”, “Cornell System”] created: 2026-03-01 maturity: evergreen

Cornell Note-Taking Method

Layout

Rules for Academic Use

PRISMA-DFLLM & PRISMA-trAIce

PRISMA-DFLLM Extension

PRISMA-trAIce Logging Format (PUMA)

Contextual Anchoring

Why It Works

Pattern

PUMA Use Cases

Related atomic notes (Phase 4.3)

Vista Gráfica

Tabla de Contenidos

Retroenlaces