Site Reliability Engineering (SRE)
The discipline of operating production systems for reliability — treating operations as a software problem, governed by SLOs (service-level objectives) and practised through incident management, postmortems, runbooks, and anomaly detection. A pillar of platform-ops.
In the sources
google-sre-agentic-ai shows Google SRE folding aiops into the full SRE workflow — runbook generation, TimesFM-based anomaly detection, comms consolidation, postmortem drafting, and incident investigation — while insisting on SLOs + fallbacks, identity/permissions, explainability, and auditability. The mandate there is telling: even AI-augmented SRE keeps the human accountable and the system constrained.
Relation to neighbours
- Investigates over observability signals and the service-topology (netflix-service-topology) to locate a failure’s blast radius.
- Overlaps but does not equal platform-engineering: SRE owns reliability of what runs; platform engineering owns the platform it runs on (kubernetes-integration-tax).