Scenarios Reference¶

This document describes the three benchmark scenarios implemented in PUMA.

triage_jira¶

Class: puma.scenarios.triage_jira.TriageJiraScenario Task type: Multi-class classification Scenario ID: triage_jira

Task definition¶

Given a Jira issue with a title and description, assign one of four priority labels:

Label	Meaning
`Critical`	Blocks work; requires immediate attention
`Major`	Significant impact; should be resolved soon
`Minor`	Low impact; addressed in normal flow
`Trivial`	Cosmetic or negligible; lowest priority

Dataset¶

Property	Value
File	`data/jira_balanced_200.csv`
Rows	200 (50 per class, balanced)
Columns used	`issue_key`, `title`, `description`, `priority`
Gold column	`priority`

Parse logic¶

_ANSWER_RE = re.compile(
    r"\b(Critical|Major|Minor|Trivial)\b", re.IGNORECASE
)
# First match in the response is used; case-normalised to title case.
# Returns None if no match found.

Metrics¶

Metric	Notes
`f1_macro`	Primary metric; equal weight per class
`f1_weighted`	Weighted by class support
`accuracy`	Overall correct fraction
`per_class.<label>.precision`	Per-label
`per_class.<label>.recall`	Per-label
`per_class.<label>.f1`	Per-label
`parse_failure_rate`	Fraction where `parse_response` returned `None`

Example run-spec¶

id: triage_zeroshot
scenario: triage_jira
sample_size: 50
models: [qwen2.5:3b]
adaptation:
  strategy: [zero-shot]
inference:
  temperature: 0.0
  seed: 42
metrics: [f1_macro]

estimation_tawos¶

Class: puma.scenarios.estimation_tawos.EstimationTawosScenario Task type: Regression (ordinal Fibonacci) Scenario ID: estimation_tawos

Task definition¶

Given a user story title and description, predict the story points as a Fibonacci number.

Valid values: {1, 2, 3, 5, 8, 13, 21, 34, 55, 89}

Dataset¶

Property	Value
File	`data/tawos_clean.csv`
Rows	9 020 agile backlog items
Source	TAWOS (open-source agile dataset, SOLAR group)
Columns used	`item_id`, `title`, `description`, `story_points`
Gold column	`story_points`

Parse logic¶

# 1. Strip punctuation from response
# 2. Extract first numeric substring
# 3. If numeric value is in FIBONACCI_SERIES → return it
# 4. If |closest_fibonacci - value| ≤ 1 → snap and return
# 5. Otherwise return raw float value
# 6. Return None if no number found

Fibonacci series: {1, 2, 3, 5, 8, 13, 21, 34, 55, 89}

Metrics¶

Metric	Notes
`mae`	Mean Absolute Error — primary metric
`mdae`	Median Absolute Error — robust to outliers
`rmse`	Root Mean Squared Error
`mae_by_bin`	MAE broken down by SP range (1–3, 5–8, 13–21, 34+)
`parse_failure_rate`	Fraction where response contained no parseable number

Story-point bins¶

Bin ID	Range	Interpretation
`1-3`	1 to 3	Small stories
`5-8`	5 to 8	Medium stories
`13-21`	13 to 21	Large stories
`34+`	34 and above	Extra-large / epics

Example run-spec¶

id: estimation_zeroshot
scenario: estimation_tawos
sample_size: 50
models: [qwen2.5:3b]
adaptation:
  strategy: [zero-shot, few-shot-3]
inference:
  temperature: 0.0
  seed: 42
metrics: [mae]

prioritization_jira¶

Class: puma.scenarios.prioritization_jira.PrioritizationJiraScenario Task type: Binary classification (pairwise ranking) Scenario ID: prioritization_jira

Task definition¶

Given two Jira issues A and B, determine which has higher priority. The model must output A or B.

Pair construction¶

Pairs are sampled from the Jira dataset. The gold label is determined by the priority order:

Critical > Major > Minor > Trivial

If both issues share the same priority, the pair is skipped to ensure unambiguous gold labels.

Dataset¶

Property	Value
Base file	`data/jira_balanced_200.csv`
Sampling	Random pairs (seed-controlled)
Columns used	`issue_key`, `title`, `description`, `priority`
Gold column	`higher_priority` (`A` or `B`)

Parse logic¶

_ANSWER_RE = re.compile(r"\b([AB])\b", re.IGNORECASE)
# First match (A or B) in the response is used; uppercased.
# Returns None if no match found.

Metrics¶

Metric	Notes
`accuracy`	Primary metric — fraction of correct A/B predictions
`parse_failure_rate`	Fraction where response contained neither `A` nor `B`

Example run-spec¶

id: prioritization_zeroshot
scenario: prioritization_jira
sample_size: 40
models: [qwen2.5:3b]
adaptation:
  strategy: [zero-shot]
inference:
  temperature: 0.0
  seed: 42
metrics: [accuracy]

Common Scenario Interface¶

All scenarios implement puma.scenarios.base.Scenario:

class Scenario(ABC):
    name: str           # scenario ID used in run-specs
    task_type: str      # "classification" | "regression"

    def sample(self, n: int, seed: int = 42) -> pd.DataFrame:
        """Return n instances from the dataset."""

    def gold_label(self, instance: dict) -> str:
        """Extract the ground-truth label from a row dict."""

    def parse_response(self, response: str) -> str | None:
        """Parse the LLM response into a label. Return None on failure."""

    def valid_labels(self) -> list[str]:
        """List of all valid output labels."""