LN: Strubell, Ganesh & McCallum (2019) — Energy and Policy Considerations for Deep Learning in NLP
Bibliographic Reference
Citation: Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650). https://doi.org/10.18653/v1/P19-1355
Pass 1 — Bird’s Eye View (5 Cs)
| C | Assessment |
|---|---|
| Category | Empirical study + policy argument |
| Context | University of Massachusetts Amherst, ACL 2019. First systematic measurement of the environmental cost of NLP model training |
| Correctness | Empirically measured; used AWS instances with known power draw. Results corroborated by subsequent independent studies |
| Contributions | (1) Quantified CO₂ cost of training large NLP models (BERT, Transformer-NAS: up to 626,155 lbs CO₂); (2) Comparison with equivalent car/flight emissions; (3) Policy recommendations for NLP research community; (4) Methodology for measuring ML energy consumption |
| Clarity | Excellent — concrete numbers, clear methodology, provocative framing |
Relevance: ⭐⭐⭐⭐⭐
Strubell et al. provides the academic justification for PUMA’s carbon footprint measurement (CodeCarbon integration). PUMA’s sustainability reporting methodology is directly traceable to this paper.
Pass 2 — Key Concepts
The Carbon Cost of NLP Training
Key findings (2019 figures):
| Model | CO₂ eq (lbs) | Equivalent to |
|---|---|---|
| Transformer (base) | 26 | ~1 flight NY–SF |
| GPT-2 | ~300 | ~30 flights NY–SF |
| Transformer-NAS (neural arch search) | 626,155 | ~5× lifetime car emissions |
| BERT training | ~1,400 | ~125 flights NY–SF |
The CO₂ Measurement Methodology
Strubell et al.’s approach (basis for CodeCarbon):
Where:
- = energy consumed (kWh) = Power draw × Duration
- = carbon intensity of electricity grid (kg CO₂/kWh)
This two-factor model is implemented in CodeCarbon with the extension:
Where PUE (Power Usage Effectiveness) accounts for data centre overhead.
Policy Recommendations
Strubell et al. make three policy recommendations:
- Reporting standards: NLP papers should report training cost alongside performance metrics
- Equitable access: High compute cost creates barriers for researchers without industry resources
- Efficiency incentives: Research community should prioritise efficient models, not just maximally accurate ones
These recommendations directly motivated the ML sustainability movement (Green AI, SustaiNLP workshops).
Inference vs. Training Cost
A critical distinction the paper emphasises:
- Training is the dominant environmental cost (600k lbs CO₂ for NAS)
- Inference is orders of magnitude cheaper (PUMA uses pre-trained models — inference only)
PUMA’s carbon footprint comes entirely from inference — running already-trained models on 200–1000 issues. This is at the milligram CO₂ scale, not tonne scale. However, measuring it demonstrates scientific rigour and establishes baselines for production SmartPMO deployment.
PUMA Integration
- Ch.3 Methods / Sustainability subsection: Strubell et al. as the methodological basis for CodeCarbon integration
- CO₂eq formula: Directly from this paper (extended with PUE in PN-ComputationalSustainability)
- Framing: PUMA measures inference cost, not training cost — proportionately tiny, but establishes methodology for production deployment
Related Notes
- PN-ComputationalSustainability — full CodeCarbon integration, CO₂eq formula, hardware baselines
- Ethics-Review-Log — sustainability as an ethics consideration in PUMA
- PN-LLM-Models-PUMA — model parameter counts that determine inference cost