LN: Strubell, Ganesh & McCallum (2019) — Energy and Policy Considerations for Deep Learning in NLP

Bibliographic Reference

Citation: Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650). https://doi.org/10.18653/v1/P19-1355


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryEmpirical study + policy argument
ContextUniversity of Massachusetts Amherst, ACL 2019. First systematic measurement of the environmental cost of NLP model training
CorrectnessEmpirically measured; used AWS instances with known power draw. Results corroborated by subsequent independent studies
Contributions(1) Quantified CO₂ cost of training large NLP models (BERT, Transformer-NAS: up to 626,155 lbs CO₂); (2) Comparison with equivalent car/flight emissions; (3) Policy recommendations for NLP research community; (4) Methodology for measuring ML energy consumption
ClarityExcellent — concrete numbers, clear methodology, provocative framing

Relevance: ⭐⭐⭐⭐⭐

Strubell et al. provides the academic justification for PUMA’s carbon footprint measurement (CodeCarbon integration). PUMA’s sustainability reporting methodology is directly traceable to this paper.


Pass 2 — Key Concepts

The Carbon Cost of NLP Training

Key findings (2019 figures):

ModelCO₂ eq (lbs)Equivalent to
Transformer (base)26~1 flight NY–SF
GPT-2~300~30 flights NY–SF
Transformer-NAS (neural arch search)626,155~5× lifetime car emissions
BERT training~1,400~125 flights NY–SF

The CO₂ Measurement Methodology

Strubell et al.’s approach (basis for CodeCarbon):

Where:

  • = energy consumed (kWh) = Power draw × Duration
  • = carbon intensity of electricity grid (kg CO₂/kWh)

This two-factor model is implemented in CodeCarbon with the extension:

Where PUE (Power Usage Effectiveness) accounts for data centre overhead.

Policy Recommendations

Strubell et al. make three policy recommendations:

  1. Reporting standards: NLP papers should report training cost alongside performance metrics
  2. Equitable access: High compute cost creates barriers for researchers without industry resources
  3. Efficiency incentives: Research community should prioritise efficient models, not just maximally accurate ones

These recommendations directly motivated the ML sustainability movement (Green AI, SustaiNLP workshops).

Inference vs. Training Cost

A critical distinction the paper emphasises:

  • Training is the dominant environmental cost (600k lbs CO₂ for NAS)
  • Inference is orders of magnitude cheaper (PUMA uses pre-trained models — inference only)

PUMA’s carbon footprint comes entirely from inference — running already-trained models on 200–1000 issues. This is at the milligram CO₂ scale, not tonne scale. However, measuring it demonstrates scientific rigour and establishes baselines for production SmartPMO deployment.


PUMA Integration

  • Ch.3 Methods / Sustainability subsection: Strubell et al. as the methodological basis for CodeCarbon integration
  • CO₂eq formula: Directly from this paper (extended with PUE in PN-ComputationalSustainability)
  • Framing: PUMA measures inference cost, not training cost — proportionately tiny, but establishes methodology for production deployment

MOCs