Carbon Tracking Log — PUMA Experiments

PUMA tracks gCO₂eq for every experimental condition using CodeCarbon.

Overview

This log is the first systematic PM+LLM carbon measurement dataset in the literature.


Why Carbon Matters in PUMA

PUMA is the first PM+LLM benchmark to include carbon measurement as a primary experimental variable — not an afterthought. This follows Strubell et al. (2019) methodology and responds to the identified gap that no existing PM+LLM paper measures environmental cost.

The hypothesis: local quantized models (4-bit, CPU) have dramatically lower carbon footprint than cloud API models, making them environmentally preferable when quality is sufficient.


CodeCarbon Setup

from codecarbon import EmissionsTracker
 
# Per-experiment tracking (not global)
tracker = EmissionsTracker(
    project_name=f"puma-{model}-{strategy}-{task}",
    output_dir="./results/carbon/",
    output_file=f"emissions_{model}_{strategy}_{task}.csv",
    log_level="error",
    country_iso_code="ESP",  # Spain — EU grid mix
    save_to_file=True
)
tracker.start()
# ... inference calls ...
emissions_kg = tracker.stop()  # returns kg CO2 eq
emissions_gco2 = emissions_kg * 1000  # convert to gCO2 eq

Results Log (to fill during F2–F4)

ExperimentModelStrategyTaskgCO₂eqDuration (s)kWhDate
Pending F2llama3.2:8bzero-shottriageTBDTBDTBD
Pending F2llama3.2:8bfew-shot-3triageTBDTBDTBD
Pending F2llama3.2:8bfew-shot-6triageTBDTBDTBD
Pending F2llama3.2:8bcottriageTBDTBDTBD
Pending F2mistral:7bzero-shottriageTBDTBDTBD
Pending F2mistral:7bfew-shot-3triageTBDTBDTBD
Pending F2mistral:7bfew-shot-6triageTBDTBDTBD
Pending F2mistral:7bcottriageTBDTBDTBD
Pending F3llama3.2:8bzero-shotestimationTBD
[…continue…]

Reference Values (Literature)

SystemApprox. CO₂Source
GPT-3 training552,000 kgStrubell et al. (2019)
Single GPT-4 API call~0.002–0.02 kgEstimates
PUMA full experiment (est.)< 0.050 kg totalTarget

Analysis Plan (Chapter 4)

  • Carbon per condition (bar chart, sorted by gCO₂eq)
  • Carbon vs F1 tradeoff frontier (scatter plot)
  • Carbon vs MAE tradeoff frontier
  • Total PUMA experiment carbon vs estimated GPT-4 equivalent
  • Recommendation: “minimum carbon condition achieving MVP threshold”

PN-KeyConcepts-Agents-Reproducibility-RedTeam (Reproducibility section) | LN-Tools-Dev-Stack PR-PUMA-Ch4-Results | SP-Architecture (§CodeCarbon layer) PN-IssueTriage-StoryPoints — tasks tracked | EX-Hypotheses-H1-H2 — conditions tracked MOC-PUMA-Master | MOC-Methods-Frameworks



id: PRISMA-Log title: “PRISMA Screening Log” type: log tags: [slr, prisma, screening, log, literature] created: 2026-03-01

PRISMA Screening Log

Transparent record of all screening decisions for the PUMA SLR. Every AI-assisted decision is documented (PRISMA-trAIce compliance).


Identification Log

DateDatabaseSearch String (abbreviated)N returnedN forwarded
TBDarXivLLM benchmark PM estimationTBDTBD
TBDIEEE XploreLLM issue triage 2022-2026TBDTBD
TBDACM DLLLM story point effortTBDTBD
TBDSemantic Scholar (API)LLM SE benchmark reproducibleTBDTBD
TBDGoogle ScholarLocal LLM PM benchmarkTBDTBD

Total identified: TBD After deduplication: TBD


Screening Decisions (Title/Abstract)

Zotero KeyTitle (abbreviated)AI suggestHuman decisionReason (if excluded)
Tawosi2024CoGEE: Story Point EstimationIncludeInclude
Angermeir2025Reproducibility LLM Studies SEIncludeInclude
Tawosi2022TAWOS DatasetIncludeInclude
Ortu2015Jira Social RepositoryIncludeInclude
Manzoor2025AI in PM 2019-2024IncludeInclude
[add all screened papers]

AI-Assisted Screening (PRISMA-trAIce)

DateToolTaskN processedN uncertainValidation done
TBDElicitAbstract screening batch 1TBDTBDHuman reviewed all uncertain + 20% sample

Eligibility Decisions (Full Text)

Zotero KeyDecisionReason
Tawosi2024IncludeFull paper verified, TAWOS dataset, GPT-4, MAE reported
Angermeir2025IncludeFull paper verified, 85 papers meta-study, ICSE 2026
[add all]

Final PRISMA Counts

StageN
IdentifiedTBD
After deduplicationTBD
Screened (title/abstract)TBD
Excluded at screeningTBD
Full-text reviewedTBD
Excluded at full-textTBD
IncludedTBD (target ≥40)