LN: Chen et al. (2024) — Automatic Root Cause Analysis via LLMs for Cloud Incidents

Bibliographic Reference

Citation: Chen, Y., Xie, H., Ma, M., et al. (2024). Automatic root cause analysis via large language models for cloud incidents. EuroSys 2024. arXiv:2305.15778. https://arxiv.org/abs/2305.15778

Important Note

Overview

The bibliography listed arXiv:2407.11638 as the ID, which points to a different paper. The correct arXiv ID is 2305.15778. The title should be “Automatic” not “Automated.” DOI: 10.1145/3627703.3629553 (EuroSys 2024).

Affiliation: Microsoft


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategorySystem proposal + industrial deployment
ContextMicrosoft’s production system for cloud incident management
CorrectnessDeployed in production at Microsoft. Evaluated on real incidents.
Contributions(1) LLM-based automated root cause analysis for cloud incidents; (2) Significant reduction in MTTD; (3) Multi-step reasoning over telemetry data; (4) Production deployment evidence
ClarityGood. Industrial paper with real metrics.

Relevance: ⭐⭐⭐

PUMA’s AIOps context (incident triage, risk detection) is directly related. This paper shows LLMs can automate production-grade incident analysis. Reference for PUMA’s justification of AI-automated triage value (Section 1.1).


PUMA Connection

Supports the claim that LLM-based automated triage has production-grade value (not just research-grade). Microsoft’s deployment validates the practical relevance of PUMA’s triage module.

MOCs